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Introduction 

Antiestrogens  have  been  successfully  used  in  the  management  of  breast  cancer  since  the  first 
clinical  trial  of  Tamoxifen  (TAM)  in  1971  [1].  TAM  produces  a  significant  increase  in  both 
overall  and  recurrence-free  survival  but  resistance  almost  inevitably  arises  in  most  patients  [2,3]. 
We  hypothesize  that  one  form  of  acquired  antiestrogen  resistance  reflects  the  altered  expression 
of  what  were  previously  estrogen-regulated  genes.  We  further  hypothesize  that  only  a  subset  of 
all  estrogen  (E2)-regulated  genes,  those  comprising  a  specific  gene  network,  is  responsible  for 
the  resistance  phenotype.  Since  TAM  (triphenylethylene)  and  ICI  182,780  (steroidal)  induce 
different  ER  conformations,  we  also  hypothesize  that  the  consequent  patterns  of  gene  regulation 
will  be  different  and  dictate  the  presence/absence  of  crossresistance  among  antiestrogens. 

To  address  these  hypotheses,  we  have  generated  novel  E2-independent  and  antiestrogen 
resistant  variants  of  the  E2 -dependent,  MCF-7  human  breast  cancer  cell  line  (MCF7/MIII, 
MCF7/LCC1,  MCF7/LCC2,  MCF-7/LCC9)  -  recently  reviewed  in  [4].  We  also  have  assembled 
a  panel  of  additional  resistant  cells  from  within  this  institution  and  from  other  investigators. 
These  include  additional  antiestrogen  resistant  MCF-7  variants  (LY2,  R27,  R3,  MCF-7RR),  all 
of  which  express  ER,  and  the  ER-negative  ZR-75-1  (ZR75/LCC3,  ZR-75-9al)  and  T47D 
(T47Dco)  variants.  Other  resistance  models  are  currently  being  obtained  from  other  laboratories 
or  being  generated  by  selection  in  vivo  selection  against  TAM  in  athymic  nude  rats  (rats  and 
humans  perceive  TAM  as  a  partial  agonist,  mice  perceive  TAM  as  a  pure  agonist). 

This  is  an  Idea  Award  to  study  the  genes  and  patterns  of  genes  expressed  in  acquired 
antiestrogen  resistance  in  cell  culture  models.  The  PI  will  apply  new,  state-of-the-art 
technologies  to  identify  key  endocrine-regulated  molecular  pathways  to  apoptosis/proliferation. 
By  identifying  key  components  of  these  pathways,  we  may  be  able  to  predict  response  to  first- 
line  and  crossover  antiestrogenic  therapies,  and/or  provide  novel  therapeutic  strategies  for 
antiestrogen  resistant  tumors. 


Body  of  Text 

Our  purpose  is  to  evaluate  a  series  of  antiestrogen  responsive  and  resistant  breast  cancer  cell 
lines  for  their  patterns  of  gene  expression.  We  will  explore  these  data,  using  state-of-the-art 
clustering  pattern  analysis  through  joint  use  of  the  standard  Finite  Normal  Mixture  models  and 
probabilistic  component  subspaces,  where  the  multimodal  clusters  will  be  automatically 
identified  using  Akaike  information  criterion  and  Minimal  Description  Length  analyses.  We  also 
will  apply  the  more  computationally  simplistic  methods  used  by  others  in  the  field. 

In  our  previous  report,  we  made  one  change  to  the  specific  aims  and  Statement  of  Work. 
Our  collaborations  with  Dr.  Wang's  group  at  Catholic  University  of  America  have  increased 
substantially,  and  we  have  begun  to  develop  and  test  several  new  algorithms  for  mining  the  high 
dimensional  data  sets  produced  by  gene  expression  microarray  analyses. 
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Specific  Aims 

Specific  Aim  1:  use  gene  microarrays  to  identify  differentially  expressed  genes  in  a  panel  of 
breast  cancer  cell  lines. 

Specific  Aim  2:  explore  the  data  from  Aim  1  to  identify  those  differentially  expressed  gene 
clusters  most  closely  associated  with  acquired  antiestrogen  resistance  and  test  further  novel 
algorithms  for  the  analysis  of  gene  expression  microarray  data. 

Specific  Aim  3:  begin  to  assess  the  likely  functional  relevance  of  representative  members  of 
these  clusters  and  study  their  expression  in  human  breast  cancer  biopsies. 

Long  term  aims:  establish  a  pattem(s)  of  gene  clusters  that  can  predict  antiestrogen  responses  in 
patients.  This  could  lead  to  a  more  effective  identification  of  candidates  for  specific  antiestrogen 
therapies  and  identify  those  patients  least  likely  to  respond  and  who  may  benefit  from  an  early 
initiation  of  cytotoxic  chemotherapy. 


Statement  of  Work  and  Progress  on  the  Work  Proposed 

The  Specific  Aims  of  this  application  are  being  addressed  in  the  studies  outlined  in  the  Statement 
of  Work. 

TASK  1:  Use  gene  microarrays  to  identify  differentially  expressed  genes  in  a  panel  of  breast 
cancer  cell  lines. 

A.  Expand  cells  and  prepare  RNA  from  cell  lines  for  pilot  study 

B.  Label  RNA  populations,  probe  microarrays  and  digitize  data 

C.  Optimize  probing/reprobing  as  necessary 

D.  Expand  cells  and  prepare  RNA  from  replicate  cultures  of  remaining  cell  lines  (including 
ER-negative  cells)  for  the  baseline  study 

E.  Label  RNA  populations,  probe  microarrays,  and  digitize  data 

F.  Expand  cells,  treat  with  ICI  1 82,780  and  4-hydroxyTAM  and  prepare  RNA  from 
replicate  cultures 

E.  Label  RNA  populations,  probe  microarrays,  and  digitize  data 

We  have  effectively  completed  this  aim  and  have  generated  all  the  treated  cell  populations  and 
RNAs.  We  have  arrayed  most  of  the  populations  on  the  initial  Research  Genetics  arrays  and 
individually  aligned  all  of  the  digitized  images.  We  used  Pathways  vs.  4.0  and  independently 
align  each  of  the  -4,000  spots/array;  this  is  rather  time  consuming  but  provides  much  higher 
quality  data  than  using  only  the  software  to  align  automatically  each  spot.  In  year  3,  we  will 
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perform  limited  studies  on  the  same  RNAs  using  our  Affymetrix  system  and  compare  data  across 
platforms. 

We  have  one  paper  on  our  initial  studies  from  the  Clontech  arrays  (Gu  et  al.,  Cancer  Res  62: 
3428-3437,  2002).  A  paper  on  the  data  from  the  Research  Genetics  platform  is  in  preparation. 


TASK  2:  Explore  the  data  from  Aim  1  to  identify  those  differentially  expressed  gene  clusters 

most  closely  associated  with  acquired  antiestrogen  resistance. 

A.  Perform  preliminary  analysis  of  pilot  study  and  identify  candidates  for  further  study 

B.  Generate  reagents  and  confirm  differential  regulation/expression  of  candidates  from  the 
pilot  study 

C.  Analyze  the  data  from  the  baseline  study  (includes  evaluation  of  ER-negative  models 
both  separately  and  together  with  ER-positive  cell)  using  all  four  data  analysis 
approaches  and  identify  candidates  for  further  study 

D.  Generate  reagents  and  confirm  differential  regulation/expression  of  candidates  from  the 
baseline  study 

E.  Analyze  the  data  from  the  treatment  study  using  all  four  approaches  and  identify 
candidates  for  further  study 

F.  Perform  overall  and  final  analyses,  compare  data  from  each  analytical  method  and 
identify  candidates  for  further  study 

G.  Generate  reagents  and  confirm  differential  regulation/expression  of  candidates  from  the 
treatment  study 

H.  Test  novel  algorithms  for  the  analysis  of  gene  expression  microarray  data 

We  have  completed  "A"  and  "B"  and  (in  Aim  1)  and  generated  most  of  the  data/reagents  needed 

for  "C"  and  "D". 


Our  initial  studies  identified  several  genes  of  interest.  The  paper  by  Gu  et  al.  describes  several  of 
these  for  which  we  have  already  confirmed  differential  expression  and/or  antiestrogenic 
regulation.  Other  genes  in  this  paper  will  be  evaluated  under  ”D".  Our  data  in  this  paper  includes 
both  microarray  (Table  1)  and  serial  analysis  of  gene  expression  (SAGE;  Table  2)  data.  Since  our 
study  is  hypothesis  driven  rather  than  technology  driven,  we  will  include  genes  identified  by 
SAGE  in  our  future  studies. 
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Table  1  Representative  list  of  differentially  expressed  genes  identified  by  gene  microarray 
analyses 


Genea 

Unigene  # 

MCF7/LCClb 

MCF7/LCC9 

Gene  Function 

NFkB 

Hs.75569 

1 

2 

transcription  factor  involved  in 
cell  survival  signaling 

SOD 

Hs.75428 

1 

2 

enzyme  involved  in  detoxifying 
oxygen  radicals 

EGR-1 

Hs.326035 

3 

1 

transcription  factor 

EGFR 

Hs.77432 

2 

1 

growth  factor  receptor 

IRF-1 

Hs.80645 

2 

1 

transcription  factor  involved  in 
signaling  to  cell  cycle  arrest  and 
apoptosis 

TNFa 

Hs.241570 

2 

1 

Cytokine 

TNF-R1 

Hs.159 

2 

1 

cytokine  receptor  involved  in 
signaling  to  apoptosis 

“Abbreviations  are  NFkB,  nuclear  factor  kappa  B;  SOD,  superoxide  dismutase;  EGR-1,  early 
growth  response  gene-1;  EGF-R,  epidermal  growth  factor  receptor;  IRF-1,  interferon  regulatory 
factor  -1;  TNFa,  tumor  necrosis  factor  alpha;  TNF-R1, tumor  necrosis  factor-receptor  1. 


'’Data  are  represented  as  level  of  expression  relative  to  the  other  cell  line.  Data  are  based  on  the 
mean  values  for  each  gene  (6  microarrays  of  MCF7/LCC1;  5  microarrays  of  MCF7/LCC9). 
Values  are  expressed  to  the  nearest  integer. 
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Table  2  Differentially  expressed  genes  identified  in  the  MCF7/LCC1  and  MCF7/LCC9  SAGE 
libraries 

Putative  Gene® 

Unigene  # 

MCF7/ 

LCC1 

MCF7/ 

LCC9 

Differ¬ 

ence15 

P- 

valuec 

Gene  Function 

N-ras  related 
gene 

Hs.260523 

2 

20 

10-fold 

<0.001 

G-protein 

Cathepsin  D 

Hs.343475 

7 

34 

5-fold 

<0.001 

protease  involved  in 
tumor  invasion 

X-box  binding 
protein- 1 

Hs.  149923 

7 

25 

4-fold 

<0.001 

transcription  factor 

Prefoldin  5 

Hs.288856 

6 

21 

4-fold 

0.002 

chaperone  for 
unfolded  proteins 

HSP-27 

Hs.76067 

23 

55 

2-fold 

0.001 

stress  response 
protein 

VitB-12  binding 
protein 

Hs.2012 

H 

17 

37 

2-fold 

0.002 

vitamin  binding 
protein 

Nucleophosmind 

Hs.9614 

10 

14 

1.5- 

fold 

>0.05 

oncogenic  nucleolar 
protein 

L14 

Hs.738 

13 

2 

6-fold 

0.021 

ribosomal  protein 

Death  associated 
protein-6 

Hs.336916 

11 

2 

6-fold 

0.049 

apoptosis  associated 
protein 

EF-y 

Hs.2186 

22 

6 

4-fold 

0.014 

translation 
elongation  factor 

Ferritin,  heavy 
polypeptide- 1 

Hs.62954 

54 

16 

3-fold 

<0.001 

iron  binding  protein 

a  The  gene  designations  are  considered  putative,  although,  the  identity  of  most  genes  designated 


in  this  fashion  have  been  shown  to  be  correct.  These  genes  include  those  Tags  where:  (a)  the  fold 
difference  is  ^2-fold,  (b)  the  Tag  could  represent  <2  genes,  and  (3)  represents  2:0.10%  of  either 
the  MCF7/LCC1  and/or  MCF7/LCC9  SAGE  library. 

bPredicted  fold  difference  in  gene  expression  between  MCF7/LCC1  vs.  MCF7/LCC9  cells. 
cObtained  by  %2  analyses;  p-values  estimated  to  3  significant  figures. 

d  NPM  (not  statistically  significant)  is  shown  because  we  know  it  to  be  both  estrogen  regulated 
and  associated  with  TAM  treatment  in  patients. 
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We  have  already  trained  and  tested  our  initial  neural  predictors  of  antiestrogen  resistance,  using  the 
data  from  the  Research  Genetics  platform.  Since  these  studies  are  not  finished,  we  will  present  the 
mature  data  in  our  next  annual  report.  We  hope  also  to  have  published  this  predictor  before  the  next 
report  and  to  include  a  preprint  in  the  report.  The  genes  in  this  predictor,  or  other  differentially 
expressed  genes  associated  with  the  phenotypes,  will  be  the  candidate  genes  for  "F"  and  "G". 

In  our  last  report  we  presented  a  new  normalization  algorithm;  this  study  has  now  been  published 
(IEEE  Trans  InfTechnol  Biomed,  6:  29-37, 2002).  We  also  have  developed  a  novel  "block  principal 
components  analysis"  method  for  exploring  gene  expression  microarray  data.  A  manuscript  has  been 
submitted  and  the  merthod  and  a  reprint  will  be  included  in  the  next  report.  Thus,  we  continue  to 
address  successfully  Task  2H. 


TASK  3:  Begin  to  assess  the  likely  functional  relevance  of  representative  members  of  these  clusters 
and  study  their  expression  in  human  breast  cancer  biopsies. 

A.  Obtain/generate  reagents  for  the  1  -2  candidates  from  the  pilot  study 

B.  Initiate  pilot  studies  using  transient  transfection  analyses 

C.  Initiate  functional  (transient)  studies  of  candidates  from  baseline  study 

D.  Initiate  functional  (stable  transfection)  studies  of  candidates  from  baseline  study 

E.  Initiate  functional  (transient)  studies  of  candidates  from  treatment  study 

F.  Initiate  functional  (stable  transfection)  studies  of  candidates  from  overall  analysis  (only  if  new 
candidates  are  identified) 

We  continue  to  investigate  the  functional  relevance  of  those  genes/proteins  that  receive  sufficient 
priority.  We  cannot  perform  detailed  functional  studies  of  all  our  candidates  within  this  application  but 
have  used  the  present  DOD  award  to  obtain  preliminary  data  to  support  requests  for  funding  to  study 
specific  genes.  In  this  regard,  we  successfully  used  the  preliminary  data  generated  on  XBP-1  to  attract 
additional  DOD  funding  to  perform  detailed  mechanistic  and  translational  studies  of  XBP1.  Thus,  we 
are  now  able  to  study  XBP-1  in  the  broader  context  of  its  role  in  endocrine  signaling  in  breast  cancer, 
but  with  a  focus  on  its  potential  contribution  to  acquired  estrogen-independence  and  antiestrogen 
resistance.  This  successful  application  also  includes  retrospective  studies  to  identify  the  prognostic  role 
of  XBP-1  in  breast  cancer  (outcome  independent  of  therapy)  and  to  assess  whether  it  may  have 
predictive  relevance  in  improving  the  ability  to  predict  which  patients  are  most  likely  to  respond  to 
endocrine  therapies. 

We  have  completed  studies  showing  that  antiestrogen  resistant  MCF7/LCC9  cells,  which  overexpress 
NFkB  transactivation  (promoter-reporter  activity),  are  more  sensitive  to  the  growth  inhibitory  effects 
of  Parthenolide,  a  specific  inhibitor  of  NFkB.  Growth  inhibition  was  assessed  using  a  dye-based  assay 
that  effectively  estimates  cell  number.  These  data  are  consistent  with  our  hypothesis  that  increased 
NFkB  activation  in  these  cells  contributes  to  their  ability  to  survive  prolonged  antiestrogen  exposure. 
These  data  are  published  in  Gu  et  al..  Cancer  Res  62:  3428-3437, 2002). 

To  maintain  focus  within  this  application,  we  have  limited  our  initial  studies  to  NFkB  and  IRF-1.  Our 
intention  is  to  obtain  sufficient  preliminary  data  to  support  an  R01  or  DOD  application  focused  on 
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these  two  genes  and  their  interactions  in  antiestrogen  resistance.  We  have  continued  to  study  the  role 
of  our  dominant  negative  interferon  regulatory  factor-1  (dnIRF-1).  Our  studies  progressed  somewhat 
more  slowly  than  expected,  mostly  due  to  the  Fellow  performing  the  studies  taking  maternity  leave. 
However,  we  hope  to  be  back  on  track  within  the  next  few  months  and  complete  the  few  remaining 
experiments  required  to  solidify  and  extend  the  data  presented  in  last  year's  report.  We  will  then  submit 
a  manuscript  on  dnIRF-1  (within  the  next  12  months).  The  mature  data  will  be  included  in  our  next 
annual  report. 


Key  Research  Accomplishments  (bulleted) 

•  Completed  and  published  manuscript  describing  data  from  gene  microarray  and  SAGE  studies 
based  on  the  data  presented  in  the  previous  report.  These  data  show  the  altered  regulation  of  X- 
box  binding  protein- 1,  NFkB,  NPM  and  IRF-1  in  acquired  antiestrogen  resistance  (manuscript 
submitted). 

•  Completed  collection  of  RNA  from  resistant  and  parental  cell  cultures. 

•  Obtained  microarray  data  from  resistant  and  parental  cell  cultures. 

•  Completed  microarray  data  preprocessing  and  confirmed  data  quality. 

•  Built  initial  neural  predictors  -  will  be  completed  within  the  next  few  months. 

•  Completed  studies  implicating  NFkB  as  a  mediator  of  survival  from  prolonged  antiestrogen 
exposure. 

•  Completed  and  published  a  new  algorithm  based  on  regression  through  the  origin  for 
normalizing  gene  expression  microarray  data. 

•  Completed  and  published  a  pilot  study  showing  our  ability  to  generate  accurate  predictive 
neural  networks  based  on  gene  expression  microarray  data.  The  neural  network  predictors  that 
can  accurately  identify  the  phenotype  of  unknown  samples  as  being  cancer  or  noncancer. 


Reportable  Outcomes 

Reportable  outcomes  are  presented  as  manuscripts  and  abstracts. 


Manuscripts  and  Abstracts 

We  have  published  several  studies  directly  related  to  the  funded  work. 
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1.  Ellis,  M.,  Davis,  N.,  Coop,  A.,  Liu,  M.,  Schumaker,  L.,  Lee,  R.Y.,  Srikanchana,  R.,  Russell,  C., 
Singh,  B.,  Miller,  W.R.,  Steams,  V.,  Pennanen,  M.,  Tsangaris,  T.,  Gallagher,  A.,  Liu,  A.,  Zwart,  A., 
Hayes,  D.F.,  Lippman,  M.E.,  Wang,  Y.  &  Clarke.  R.  “Development  and  validation  of  a  method  for 
using  breast  core  needle  biopsies  for  gene  expression  microarray  analyses.”  Clin  Cancer  Res,  8:  1155- 
1166,2002. 

2.  Wang,  Y.,  Lu,  J.,  Lee,  R.  &  Clarke,  R.  “Iterative  normalization  of  cDNA  microarray  data .”  IEEE 
Trans  InfTechol  Biomed,  6: 29-37, 2002. 

3.  Gu,  Z.,  Lee,  R.Y.,  Skaar,  T.C.,  Bouker,  K.B.,  Welch,  J.N.,  Lu,  J.,  Liu,  A.,  Davis,  N.,  Leonessa,  F., 
Brunner,  N.,  Wang,  Y.  &  Clarke,  R.  “Association  of  interferon  regulatory  factor- 1,  nucleophosmin, 
nuclear  factor-KB  and  cAMP  response  element  binding  with  acquired  resistance  to  Faslodex  (ICI 
182,780).”  Cancer  Res,  8:  1 155_1 166, 2002. 

4.  Welch,  J.N.  &  Clarke,  R.  "ErbB-2  expression  and  drug  resistance  in  cancer."  Signal,  in  press 
(review). 

Reprints  of  papers  #1-3  are  included  in  the  appendix. 


Abstracts 

1.  Welch,  J.N.,  Chrysogelos,  S.  &  Clarke,  R.  "Expression  and  function  of  the  epidermal  growth  factor 
receptor  in  breast  cancer  cells  exposed  to  chemotherapy."  Proc  Am  Assoc  Cancer  Res  42:  938, 2001. 

2.  Bouker,  K.B.,  Skaar,  T.C.,  Fernandez,  D.  &  Clarke,  R.  "Antiestrogens  regulate  IRF-1  expression  in 
sensitive  but  not  resistant  breast  cancer  cells."  Proc  Am  Assoc  Cancer  Res  43:  761,  2002. 

3.  Zhu,  Y.,  Bouker,  K.,  Skaar,  T.,  Zwart,  A.,  Gomez,  B.,  Hewitt,  S.,  Singh,  B.,  Liu,  A.  &  Clarke.  R. 
"High  throughout  tissue  microarray  assessment  of  expressions  of  progression-related  genes  -  NFkB, 
nucleophosmin,  X-box  binding  protein- 1  and  IRF-1  in  breast  cancer."  Proc  Am  Assoc  Cancer  Res  43: 
762, 2002. 

We  also  presented  our  data  at  the  recent  DOD  meeting  in  Orlando,  FL. 


Conclusions 

We  have  made  good  progress  in  our  studies  on  the  molecular  characterization  of  antiestrogen 
resistance  is  evident  in  our  productivity  as  measured  by  publications  and  new  methods  and  preliminary 
data.  The  study  is  on-track  and  the  amount  of  data  accumulating  is  considerable.  However,  several  new 
algorithms  underdevelopment  are  showing  good  performance  in  our  very  preliminary  analyses  of 
published  high  dimensional  data  sets.  Our  data  with  NFkB,  IRF-1  and  the  dnIRF-1  are  encouraging 
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and  suggest  we  may  be  on  the  right  track  to  identifying  new  signal  transduction  pathways  associated 
with  acquired  antiestrogen  resistance.  For  example,  these  data  show  that  resistant  cells  are  more 
sensitive  to  inhibition  ofNFicB.  Overexpression  of  IRF-1,  which  is  suppressed  by  estrogens  and 
induced  by  antiestrogens,  is  associated  with  reduced  cell  proliferation.  The  dnIRF-1  provide  an 
opportunity  to  further  explore  some  of  the  mechanistic  effects  of  this  gene  in  acquired  antiestrogen 
resistance.  We  also  have  been  successful  in  using  the  preliminary  data  generated  in  this  application  to 
attract  other  funding. 
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Iterative  Normalization  of  cDNA  Microarray  Data 

Yue  Wang,  Jianping  Lu,  Richard  Lee,  Zhiping  Gu,  and  Robert  Clarke 


Abstract — This  paper  describes  a  new  approach  to  normalizing 
microarray  expression  data.  The  novel  feature  is  to  unify  the  tasks 
of  estimating  normalization  coefficients  and  identifying  control 
gene  set.  Unification  is  realized  by  constructing  a  window  function 
over  the  scatter  plot  defining  the  subset  of  constantly  expressed 
genes  and  by  affecting  optimization  using  an  iterative  procedure. 
The  structure  of  window  function  gates  contributions  to  the 
control  gene  set  used  to  estimate  normalization  coefficients.  This 
window  measures  the  consistency  of  the  matched  neighborhoods 
in  the  scatter  plot  and  provides  a  means  of  rejecting  control  gene 
outliers.  The  recovery  of  normalizational  regression  and  control 
gene  selection  are  interleaved  and  are  realized  by  applying  coupled 
operations  to  the  mean  square  error  function.  In  this  way,  the  two 
processes  bootstrap  one  another.  We  evaluate  the  technique  on  real 
microarray  data  from  breast  cancer  cell  lines  and  complement  the 
experiment  with  a  data  cluster  visualization  study. 


cDNA  array  image 


Index  Terms  Data  normalization,  dynamic  programming,  gene  Fig  ,  Example  of  cDNA  microarray  image, 
expression,  gene  microarray,  linear  regression. 


I.  Introduction 

POTTED  cDNA  microarrays  are  emerging  as  a  powerful 
and  cost-effective  tool  for  the  large-scale  analysis  of  gene 
expression.  Using  this  technology,  the  relative  expression  levels 
in  two  or  more  mRNA  populations  derived  from  tissue  samples 
can  be  assayed  for  thousands  of  genes  simultaneously  [1],  [2]. 
Microarrays  are  potentially  powerful  tools  for  investigating  the 
mechanism  of  drug  action.  Two  recent  studies  have  described 
the  application  of  high-density  microarrays  to  examine  the  ef¬ 
fects  of  drugs  on  gene  expression  in  yeast  as  a  model  system.  A 
similar  method  applied  to  human  breast  cancer  cells  and  tissues 
would  have  direct  utility  in  the  identification  and  validation  of 
novel  therapeutics.  It  is  widely  accepted  that  the  pattern  of  genes 
expressed  within  a  specific  cell  is  essentially  responsible  for  its 
phenotype.  The  most  widely  publicized  use  of  gene  microarrays 
has  been  in  cancer  research. 

From  a  statistical  point  of  view,  sources  of  measurement  error 
within  an  array,  and  variation  between  arrays,  must  be  quanti¬ 
fied  and  taken  onto  account  in  order  to  make  indirect  compar¬ 
isons  among  samples  that  have  not  been  directly  assayed  on  the 
same  array.  For  example,  gene  microarrays  vary  with  produc¬ 
tion  batches,  e.g.,  introducing  variations  in  the  amount  of  probe 
that  hybridizes  to  areas  of  the  support  that  do  not  contain  target 
cDNAs,  or  the  amount  of  the  cDNA  spotted  onto  the  support 
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surface.  The  specific  activity  of  the  probe  will  vary  from  probe 
to  probe,  often  reflecting  variations  in  the  amount  of  signal  pro¬ 
duced  by  each  molecule  of  label  incorporated  into  the  probe. 

Two  major  data  preprocessing  operations  are  involved:  back¬ 
ground  correction  and  interexperiment  normalization.  In  back¬ 
ground  correction,  local  sampling  of  background  can  be  used 
to  specify  a  threshold  that  a  true  signal  must  exceed.  It  is  even 
possible  to  accurately  detect  weak  signals  and  extract  a  mean 
intensity  above  background  for  the  target  [3].  A  typical  cDNA 
array  image  is  given  in  Fig.  1 . 

In  carrying  out  comparisons  of  expression  data  using  mea¬ 
surements  from  a  single  array  or  multiple  arrays,  the  question 
of  normalizing  data  arises.  A  reasonable  assumption,  adopted 
by  most  researchers,  is  that  all  experiments  are  carried  out  under 
conditions  of  a  large  excess  of  immobilized  probe  relative  to  la¬ 
beled  target.  The  kinetics  of  hybridization  are  therefore  pseud- 
ofirst  order,  and  interprobe  competition  is  not  a  factor  [3].  Under 
these  assumptions,  the  linear  differences  arising  from  the  exact 
amount  of  applied  target,  extent  of  target  labeling,  efficiencies 
of  fluor  excitation  and  emission,  and  detector  efficiency  can  be 
compounded  into  a  single  variable.  Two  major  strategies  can  be 
used  to  carry  out  normalization.  One  is  based  on  a  considera¬ 
tion  of  all  of  the  genes  in  the  sample,  and  the  other  on  a  desig¬ 
nated  subset  expected  to  be  unchanged  over  most  circumstances, 
called  the  control  gene  set.  In  instances  of  closely  related  sam¬ 
ples,  global  normalization  (e.g.,  using  all  genes)  will  be  a  useful 
tool.  As  samples  become  jnore  divergent,  a  good  normalization 
may  be  achieved  using  a  subset  of  constantly  expressed  genes 
(e.g.,  using  only  control  genes)  [3]. 

The  work  most  closely  related  to  our  methodology  was  re¬ 
ported  in  [4].  The  authors  introduced  a  comparison  of  gene  ex¬ 
pression  levels  arising  from  cohybridized  samples  by  taking  ra¬ 
tios  of  average  expression  levels  for  individual  genes.  A  novel 
method  of  image  segmentation  was  presented  to  identify  cDNA 
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target  sites,  and  a  hypothesis  test  and  confidence  interval  was 
developed  to  quantify  the  significance  of  observed  differences 
in  expression  ratios.  In  particular,  the  probability  density  of  the 
ratio  and  the  maximum-likelihood  estimator  for  the  distribution 
were  derived,  and  an  iterative  procedure  for  signal  calibration 
was  developed.  In  general,  however,  an  integral  of  ratios  is  not 
the  same  as  a  ratio  of  integrals,  and  simple  ratios  of  the  data  will 
not  necessarily  provide  unbiased  estimates  of  expression  ratios. 
Alternatively,  the  mean  value  of  all  signals  on  the  hybridized 
filter  can  be  used  for  normalization,  and  further  normalizations 
can  be  done  to  a  reference  hybridization  [5].  Nonetheless,  the 
optimal  approach  remains  controversial. 

II.  Method  and  Algorithm 

In  this  paper,  we  adopt  a  somewhat  different  approach  to 
the  problem  of  normalizing  microarray  expression  data.  Rather 
than  rejecting  those  control  genes  that  give  rise  to  a  large  nor¬ 
malization  error,  we  attempt  to  iteratively  correct  them.  In  a 
nutshell,  our  idea  is  to  bootstrap  by  alternating  between  esti¬ 
mating  normalization  coefficients  and  identifying  control  gene 
subset.  The  framework  is  furnished  by  constructing  a  window 
function  over  the  scatter  plot  defining  the  subset  of  constantly 
expressed  genes.  Specifically,  this  window  measures  the  con¬ 
sistency  of  the  matched  neighborhoods  in  the  scatter  plot  and 
provides  a  means  of  rejecting  control  gene  outliers.  We  eval¬ 
uate  the  technique  on  real  microarray  data  from  breast  cancer 
cell  lines  and  complement  the  experiment  with  a  data  cluster 
visualization  study. 

Our  goal  is  to  generate  a  transformation  that  best  maps  the 
expression  levels  of  floating  data  set  onto  their  counterparts  in  a 
reference  data  set.  Assume  that  data  points  {xx ,  xo,  . . . ,  xnc } 
and  {yi,  2/2,  •  • . ,  Vnc  }  are  the  expression  levels  of  the  control 
or  housekeeping  genes  from  two  microarray  experiments,  where 
nc  is  the  total  number  of  control  genes.  In  this  paper,  we  use 
{a:*}  as  the  floating  data  set  and  {2/7}  as  the  reference  data 
set.  We  further  assume  that  the  normalization  can  be  accurately 
achieved  through  a  linear  regression  mapping 

yi-axi  +  b  (1) 


setting  them  to  zero.  It  can  be  shown  that  the  estimated  linear 
regression  coefficients  a  and  b  can  be  calculated  by  [6] 
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where  fj,x  and  ny  are  the  means  of  {a:i}  and  {yi},  respectively, 
given  by 
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and  the  normalization  shall  be  performed  for  all  of  the  data 
points  in  the  floating  data  set  based  on  (1). 

The  accuracy  of  the  method  highly  depends  on  the  selection 
of  control  genes.  In  addition  to  the  predetermined  control 
genes,  including  housekeeping  genes,  we  shall  add  more 
control  genes  based  on  a  reasonable  heuristics  that  the  genes 
that  are  nondifferentially  expressed  should  be  considered  as 
control  genes  in  normalization.  Posed  in  this  way,  there  is  a 
basic  “chicken-and-egg”  problem  [7].  Before  a  good  control 
gene  subset  can  be  defined,  expression  levels  of  all  genes  need 
to  be  reasonably  normalized.  Yet,  this  normalization  is,  after 
all,  the  ultimate  goal  of  computation. 

We  propose  an  iterative  regression  normalization  algorithm  to 
solve  this  problem.  First,  solely  based  on  the  predetermined  con¬ 
trol  genes  such  as  housekeeping  genes,  we  will  conduct  an  initial 
normalization  to  all  data  sets  based  on  (1M5).  Since  an  accu¬ 
rate  data  analysis  requires  several  repetitive  cDNA  hybridiza¬ 
tions  in  microarray  studies  [8],  starting  from  the  whole  data  set, 
we  will  then  eliminate  those  genes  from  the  control  gene  list 
whose  expressions  have  a  large  standard  deviation  across  repli¬ 
cations,  namely,  outliers,  according  to  the  criterion  given  by 
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where  a  is  the  true  ratio  of  the  data  and  b  is  the  bias  correction  of 
the  data.  Since  there  are  two  free  parameters  in  the  transforma¬ 
tion,  the  estimation  of  their  values  requires  a  minimum  of  two 
data  points  that  are  known  to  be  in  correspondence.  By  consid¬ 
ering  noise  effect,  however,  more  control  points  are  needed  to 
produce  an  accurate  estimate.  This  process  is  overconstrained 
and  can  be  solved  using  least  squares  estimation.  Clearly,  a  nat¬ 
ural  criterion  is  the  minimum  mean  squared  error  between  the 
two  control  data  subsets.  Based  on  the  expression  levels  of  the 
control  genes,  the  mean  squared  error  (MSE)  can  be  written  as 

«  =  —  52  [j/»  -  (axi  +  6)]2.  (2) 
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for  all  genes,  where  m*  is  the  number  of  replications  for  gene 
i  in  the  experiment,  Xij  is  the  expression  level  of  gene  i  in 
the  jth  replication,  (m  is  the  mean  of  replications,  and  ex  is  a 
predetermined  threshold. 

In  our  experiment,  ex  is  determined  as  follows.  For  each  of  the 
genes,  the  replications  are  normalized  by  its  mean  and  the  nor¬ 
malized  standard  deviation  is  calculated.  A  mean  standard  de¬ 
viation  is  then  obtained  by  the  sample  average  of  the  individual 
normalized  standard  deviations.  Our  experience  has  shown  that 
ex  being  two  times  of  the  mean  standard  deviation  is  appropriate 
and  effective.  It  should  be  noted  that  this  criterion  will  also  elim¬ 
inate  differentially  expressed  genes  from  the  control  gene  list. 
Thus,  a  gene  will  be  selected  as  a  control  gene  if  its  expression 
level  pair  across  reference  and  floating  experiments  satisfies 


Thus,  the  search  principle  for  estimating  the  optimal  values  of 
a  and  b  is  simply  taking  the  partial  derivatives  of  the  MSE  and 
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(a) 


Fig.  2.  Example  of  control  gene  selection  window,  (a)  Before  and  (b)  after  normalization. 


(b) 


where  62,  e3,  and  64  are  the  empirically  predetermined  thresholds 
defining  the  subset  of  constantly  expressed  genes.  It  can  be  seen 
that  (7)  defines  a  window  function  over  the  scatter  plot.  A  typ¬ 
ical  window  function  is  illustrated  in  Fig.  2.  In  particular,  as  we 
have  noted  according  to  our  experience,  ratios  can  be  very  un¬ 
stable  when  one  (or  both)  of  the  signals  is  small  or  large.  Thus, 
we  further  eliminate  unstably  expressed  genes  from  the  control 
gene  list  using  the  constraints  defined  by  62  and  C3.  Clearly,  e 4 
provides  the  boundaries  of  a  constantly  expressed  gene  subset. 

The  algorithm  first  generates  the  interim  scatter  plot  of  the 
data  sets  through  the  observations  and  the  current  parameter 
estimates  [(1)]  and  then  updates  parameter  estimates  using  a 
newly  defined  control  gene  subset  [(3)— (5)).  The  procedure  cy¬ 
cles  back  and  forth  between  these  two  steps  until  it  reaches  a 
stationary  point  where  no  significant  change  occurs  to  the  con¬ 
tent  of  the  control  gene  subset.  A  summary  of  the  major  steps  is 
given  as  follows. 

1)  Based  on  predetermined  control  genes  including  house¬ 
keeping  genes,  estimate  initial  values  of  a ^  and  b^  and 
perform  an  initial  normalization  using  (3)-(5)  and  (1), 
where  only  one  data  set  is  used  as  a  reference  set  and  all 
other  data  sets  are  considered  as  floating  sets  and  shall  be 
normalized  to  the  reference  set. 

2)  Eliminate  those  genes  from  the  control  gene  list  whose 
expressions  have  a  large  standard  deviation  across  repli¬ 
cations,  according  to  the  criterion  given  by  (6). 

3)  For  each  of  experiment  pairs,  construct  a  new  control  gene 
subset  by  selecting  additional  control  genes  that  satisfy 
(7). 

4)  Based  on  the  newly  constructed  control  gene  subset, 
estimate  interim  values  of  a ^  and  b(rr^  and  perform 
data  normalization  for  each  of  the  floating  data  sets  using 
(3)-{5)  and  (1),  where  m  is  the  iteration  index. 

5)  Repeat  Steps  3)  and  4)  until  the  convergence  ( a — >  1 
and  b >  0)  is  reached  or  no  significant  change  occurs 
to  the  content  of  the  control  gene  subset. 

The  philosophy  for  estimating  normalization  coefficients  and 
identifying  a  control  gene  set  is  similar  in  spirit  to  the  self-or¬ 


ganization  principle  [9],  [10].  The  structure  of  window  function 
gates  contributions  to  the  control  gene  subset  used  to  estimate 
normalization  coefficients  such  that  possible  oscillation  during 
algorithm  convergence  can  be  prevented.  Specifically,  the 
window  function  defines  a  neighborhood  of  scatter  centroid 
to  gating  consistency  contribution  of  the  control  gene  subset 
to  normalization.  By  making  the  value  C4  of  the  topological 
window  function  decrease  with  time,  the  neighborhood  is 
initially  very  large  and  shrinks  slowly  to  its  final  desired  size 
(e.g.,  a  nearest  neighbor  structure).  A  popular  choice  for  the 
dependence  of  C4  on  discrete  time  m  is  the  exponential  decay 
[9].  In  addition,  the  actual  algorithm  implementation  concerns 
the  issue  of  numerical  stability.  We  have  applied  a  simple 
dynamic  programming  technique  to  estimating  normalization 
coefficients,  called  a  factoring-shifting  (FS)  procedure. 

F-Step 

nc 


E  *?"“>* 

i= 1 _ 


a(*M  -  *=1 

E[*i*l”f 

i=l 

x{k'\ m)  =a(fc|m)x(fcM 

(8) 

(9) 

S-Step 

(10) 

(11) 

where  at  each  complete  cycle  of  the  procedure,  we  first  use  the 
“old”  set  of  floating  data  to  determine  the  normalization  factor 
a(k\m)  usjng  (g)  by  getting  b  =  0  and  simply  rescale  floating 
data  values  using  (9).  These  interim  results  x  •' k  ^  are  then  used 

to  obtain  the  normalization  shift  using  (10)  by  setting 
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a  —  1  and  further  generate  “new”  values  of  floating 

data  using  (11).  The  procedure  cycles  back  and  forth  until  the 
values  of  and  reach  their  stationary  points. 

In  relation  to  previous  work,  the  concept  of  using  linear  re¬ 
gression  analysis  for  microarray  normalization  can  be  traced 
back  to  [4]  and  was  further  developed  in  [12]  for  iterative  re¬ 
gression  in  conjunction  with  control  gene  selection.  Such  ap¬ 
proaches  are  based  on  several  assumptions  regarding  the  data 
and  can  be  considered  as  special  cases  of  our  framework  [5]. 

The  primary  assumption  is  that  for  either  the  entire  collection 
of  arrayed  genes  or  some  subset  such  as  housekeeping  genes,  the 
shift  of  the  measured  expression  averaged  over  the  set  is  zero 
(e.g.,  6  =  0)  and  the  ratio  of  normalized  expression  pair  aver¬ 
aged  over  the  set  should  be  one  [e.g.,  l/nc  i  (v* / axi)  =  !]• 

Under  these  assumptions,  there  are  basically  three  major  ap¬ 
proaches  for  calculating  the  normalization  factor  a  [5].  The  first 
simply  uses  the  mean  value  of  all  the  background-corrected  sig¬ 
nals.  Normalization  can  be  separately  performed  for  each  of 
the  data  sets,  without  explicitly  calculating  a  and  selecting  a 
reference  data  set  [15].  Specifically,  if  a  raw  data  pair  is  de¬ 
noted  by  ({#*},  {yi}),  normalization  leads  to  {{x%/Y27= i  xi}> 
{ Vi /  Y77=i  Vi})-  By  multiplying  the  pair  with  J27~i  yu  the  re¬ 
sult  is  equivalent  to  using  (4),  that  is,  ({ axi },  {t/i}),  where 
a  =  Ya=i  Vi/Tli=Lxi  anc*  &  =  0.  A  second  approach  uses 
simplified  linear  regression  analysis,  called  linear  regression 
through  the  origin  [6].  Consequently,  a  scatter  plot  of  the  nor¬ 
malized  data  set  pair  should  have  a  slope  of  one  [5].  By  set¬ 
ting  6  =  0  in  (1)  and  (2),  the  normalization  factor  is  given 
by  a  =  Y77=i  xiVi/  xi-  A  third  approach  relies  on  the 
assumption  that,  for  control  genes,  the  distribution  of  expres¬ 
sion  levels  can  be  modeled  and  the  mean  of  the  ratio  adjusted 
to  one  [4].  An  iterative  procedure  was  developed  to  estimate  a 
by  l/nc  12?=i  (Vi/Xi)>  once  again  setting  6  =  0.  It  should  be 
noticed  that  some  heuristic  approximations  have  been  made  in 
using  these  approaches,  since,  in  general 

nc  nc 

53  y*  53  Xi'^i 

i=l  /  i=l _ 

nc  i  nc 

Xi 

i~l  t= 1 

and 

M0  (13) 

while  our  method  is  presented  as  a  standard  linear  regression 
analysis  without  any  approximation  step. 

III.  Experiment  and  Discussion 

In  this  section,  we  will  provide  experimental  evaluation  of  our 
new  normalization  method.  This  investigation  has  two  related 
strands.  First,  we  will  furnish  examples  demonstrating  the  use 
of  an  iterative  normalization  scheme  on  real  microarray  data. 
Here,  we  will  use  two  different  data  sets.  The  first  of  these 
involves  within-class  normalization  of  data  from  LCC 1  breast 
cancer  cell  lines  across  replications.  The  second  example  in¬ 
volves  normalizing  between-class  breast  cancer  cell  line  data 
from  LCC1  against  LCC9,  whose  phenotypes  are  known  to  be 
different  from  LCC1. 


In  the  second  strand  of  our  experiments,  we  will  provide  an 
algorithm  accuracy  analysis.  Here,  we  confine  ourselves  to  the 
linear  regression  variant  of  the  normalization  process.  The  aim 
is  to  experimentally  compare  our  iterative  algorithm  with  the 
performance  of  each  of  its  components  taken  individually,  thus 
to  demonstrate  that  the  combined  processing  of  both  control 
gene  selection  and  transformation  coefficient  estimation  yields 
significant  advantages  over  existing  methods.  In  addition,  we 
would  like  to  acknowledge  that  although  the  cell  lines  are  not 
fully  representative  of  solid  tumors  in  humans,  their  patterns  of 
gene  expression  profile  are  rich  in  information  with  respect  to 
drug  resistance. 

We  obtained  gene  expression  profiles  from  two  breast 
cancer  cell  lines.  MCF7/LCC1  is  an  estrogen-independent  but 
antiestrogen  responsive  variant  of  the  MCF-7  human  breast 
cancer  cell  line  [14],  [15].  An  antiestrogen  resistant  variant 
(MCF7/LCC9)  was  obtained  by  stepwise  selection  of  MCF7/ 
LCC1  cells  against  the  steroidal  antiestrogen  ICI  182  780 
(trade  name:  Faslodex).  MCF7/LCC9  cells  have  many  of  the 
characteristics  seen  in  anti  estrogen-resistant  human  breast  can¬ 
cers  and  provide  a  novel  model  in  which  to  study  antiestrogen 
resistance  [14]. 

Gene  expression  profiles  were  obtained  using  the  AtlasTM 
Human  Array  cDNA  expression  microarrays  (Clontech, 
Laboratories,  Inc.,  Palo  Alto,  CA).  These  microarrays  are 
produced  on  nylon  filters  and  contain  588  target  genes  and 
nine  housekeeping  genes.  Briefly,  total  RNA  was  obtained 
from  independent  cultures  of  MCF7/LCC1  and  MCF7/LCC9 
cells  with  the  TRIzol  reagent  (Life  Technologies,  Grand  Island, 
NY).  One  /xg  of  DNase-treated  mRNA  was  primed  with  Clon- 
tech’s  cDNA  Synthesis  Primer  mix  and  the  product  reverse 
transcribed  into  radiolabeled  cDNA  with  [-32P]  dATP  (Amer- 
sham  Life  Science  Inc.,  Arlington  Heights,  IL).  Probes  were 
purified,  denatured,  and  both  C0t-1  DNA  and  1  M  NaH2P04 
(pH  7.0)  added  to  the  denatured  probe.  Each  microarray 
was  prehybridized  with  5-ml  ExpressHyb  buffer  and  0.5-mg 
denatured  DNA  from  sheared  salmon  testes.  Microarray  filters 
were  hybridized  overnight  with  the  appropriate  [— 32P]-labeled 
cDNA  probe.  The  array  was  extensively  washed  and  sealed  in 
plastic,  with  signals  detected  by  phosphorimage  analysis  using 
a  Molecular  Dynamics  Storm  phosphorimager  (Molecular  Dy¬ 
namics,  Sunnyvale,  CA).  Digitization  of  these  signals  provided 
numerical  values  representing  the  signal  for  each  gene. 

Generally,  it  has  been  assumed  that,  under  variable  condi¬ 
tions,  the  expression  of  housekeeping  genes  remains  unchanged. 
Hence,  high-throughput  differential  expression  data  can  rely  on 
these  genes  for  data  normalization.  However,  recent  data  indi¬ 
cate  deviation  from  this  concept  [11]. 

To  assess  the  effectiveness  of  housekeeping  genes  in  nor¬ 
malizing  cDNA  microarray  data,  a  normalization  based  on 
single  linear  regression  is  performed  using  only  the  set  of  nine 
housekeeping  genes  suggested  by  CLONTECH.  The  scatter 
plots  of  normalization  results  are  given  in  Fig.  3.  Although 
log-log-based  scatter  plots  are  widely  used,  we  have  decided  to 
use  original  scaled  scatter  plots  since  our  numerical  simulations 
have  shown  possible  misleading  perceptions  from  the  “dis¬ 
torted”  shape  of  actual  data  distribution.  Particularly  focusing 
on  breast  cancer,  we  have  observed  significant  variations  in 
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Fig.  3.  (a)  Scatter  plot  of  within-class  normalized  microarray  data  based  on  nine  housekeeping  genes,  (b)  Scatter  plot  of  between-class  normalized  microarray 

data  based  on  nine  housekeeping  genes.  (Circle:  before  normalization;  dot:  after  normalization.) 
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Fig.  4,  Example  of  differential  expression  of  housekeeping  genes  (red  dots,  where  the  dashed  lines  are  the  partition  edges  of  the  window  functions),  (a)  Within-class 
and  (b)  between-class. 


the  expression  of  these  housekeeping  genes.  For  example, 
differential  expression  was  observed  between  LCC 1  and  LCC9. 
See  Fig.  4,  where  the  data  sets  were  normalized  using  the  first 
method  discussed  above.  This  fact  was  observed  from  all  of 
our  experiments  and  shared  by  the  same  observation  reported 
in  [11].  Therefore,  selection  and  use  of  housekeeping  genes 
for  normalization  of  differential  expression  data  from  various 
biological  models  should  be  approached  with  caution  [11]. 

Since  evaluation  requires  comparison  with  existing  methods, 
we  have  implemented  all  three  major  approaches  and  applied 
these  to  the  same  data  sets.  In  this  experiment,  all  genes  are 
considered  as  control  genes  and  used  in  the  calculation.  Our 
measure  of  normalization  accuracy  is  the  MSE  defined  by  (2) 
over  the  selected  control  gene  set.  The  result  of  using  the  first 


method  is  given  in  Fig.  5,  where  a  =  Vi/  'ILi-i  xi  —  9-9 
and  b  =  0;  an  MSE  of  8549  is  reached.  In  the  second  method, 
normalization  is  based  on  a  linear  regression  through  the  origin, 
i.e.,  a  —  Yli=ixiVi/J2i^ixi  that  is  most  close  to  the  cor¬ 
rect  formulation.  The  corresponding  result  is  shown  in  Fig.  6, 
where  a  =  5.0  and  b  =  0.  A  lower  MSE  of  3905  is  ob¬ 
tained,  consistent  with  our  theoretical  expectation.  In  Fig.  7, 
we  show  the  normalization  result  using  the  third  method,  i.e., 
a  =  1  fnc  As  predicted,  a  biased  estimate  of  the 

expression  ratio  is  obtained,  leading  to  a  high  MSE  of  20  728 
with  a  =  18. 

These  comparisons  clearly  indicate  that  the  three  existing  ap¬ 
proaches  are  not  equivalent,  as  shown  by  both  our  experimental 
results  and  the  theoretical  justification  of  (12).  To  illustrate  the 
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Fig.  5.  Scatter  plot  of  normalized  microarray  data  using  the  existing  method  1 . 
(Circle:  before  normalization;  dot:  after  normalization.) 
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Fig.  6.  Scatter  plot  of  normalized  microarray  data  using  the  existing  method  2. 
(Circle:  before  normalization;  dot:  after  normalization.) 
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Fig.  7.  Scatter  plot  of  normalized  microaiTay  data  using  the  existing  method  3 . 
(Circle:  before  normalization;  dot:  after  normalization.) 

impact  of  using  the  whole  gene  set  as  control  gene  set  and  using 
a  dynamic  programming  technique  on  the  normalization  accu¬ 
racy,  we  applied  method  2  to  the  differential  expression  between 
LCC1  and  LCC9.  The  scatter  plot  is  given  in  Fig.  8.  The  corre¬ 
sponding  MSE  in  this  case  is  6527,  compared  to  the  previous 
MSE  of  3905.  An  increase  in  MSE  suggests  that,  as  samples 
become  more  divergent,  a  good  normalization  may  be  achieved 
using  a  subset  of  constantly  expressed  genes  rather  than  a  global 
normalization  (e.g.,  using  all  genes)  [3].  We  then  used  the  FS 
procedure  to  estimate  both  a  and  b.  This  additional  step  further 
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Fig.  8.  Scatter  plot  of  normalized  between-class  microarray  data  using  the 
existing  method  2.  (Circle:  before  normalization;  dot:  after  normalization.) 
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Fig.  9.  Scatter  plot  of  normalized  within-class  microarray  data  based  on 
selected  control  gene  subset  using  a  static  window  function.  (Circle:  before 
normalization;  dot:  after  normalization.) 

reduced  the  MSE  to  6438,  but  this  reduction  is  probably  not  sig¬ 
nificant. 

To  explore  the  effect  of  control  gene  selection,  we  first  per¬ 
formed  an  initial  linear  regression  using  the  whole  gene  set.  Four 
different  window  functions  were  configured  to  select  control 
gene  subsets  where  €2  <  r  <  e 3  and  <j>  is  the  sector  angle  of  the 
window  function.  Based  on  the  selected  control  genes,  we  then 
applied  a  single  linear  regression  to  normalizing  within-class 
samples.  A  numerical  comparison  on  the  normalization  accu¬ 
racy  of  using  different  control  gene  subsets  is  conducted,  as  re¬ 
ported  in  Table  I.  The  main  feature  to  note  from  these  results 
is  that,  for  different  window  functions,  a  stable  estimate  of  the 
scaling  factor  a  is  obtained,  while  the  shifting  offset  b  varies  sig¬ 
nificantly  from  case  to  case.  In  addition,  the  MSEs  of  normaliza¬ 
tions  in  all  three  cases  are  comparable  (i.e.,  5632^5796).  The 
scatter  plot  of  the  best  normalization  result  is  shown  in  Fig.  9. 

We  further  applied  the  same  procedure  to  processing  be¬ 
tween-class  samples  and  observed  similar  data  characteristics. 
The  scaling  factor  in  this  case  is  about  a  —  44,  while  b  varies 
substantially.  Not  surprisingly,  an  increase  in  MSE  is  observed 
(i.e.,  6754^7384).  Numerical  analysis  with  different  window 
functions  shows  the  capable  nature  of  the  approach,  since 
the  interim  estimate  of  linear  regression  coefficient  is  very 
stable  with  a  satisfactory  low  MSE.  Indeed,  the  robustness  of 
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TABLE  I 

Numerical  Comparisons  of  Normalization  Results  Based  on  a  Designated  Subset  of  Control  Genes  With  Different  Window  Configurations 
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Fig.  10.  Scatter  plot  of  normalized  between-class  microarray  data  based  on 
selected  control  gene  subset  using  a  static  window  function.  (Circle:  before 
normalization;  dot:  after  normalization.) 

the  gene  selection  step  has  been  successfully  discovered  in 
all  our  experiments.  A  typical  scatter  plot  of  between-class 
normalization  results,  using  a  window-based  control  gene 
selection,  is  given  in  Fig.  10. 

Next,  we  provide  an  illustration  of  the  iterative  properties 
of  our  normalization  algorithm.  The  sequence  in  our  experi¬ 
ment  shows  the  iterative  recovery  of  the  full  linear  regression 
matching.  In  this  within-class  case,  10  000  <  r  <  24000 
and  <f>  =  7t/2,  7t/4,  7r/8,  7r/16,  7t/32,  7r/48.  Each  window 
shrinking  step  is  mixed  with  one  of  the  FS  steps  using  the  cur¬ 
rent  set  of  recovered  data  points.  The  initial  parameters  are  esti¬ 
mated  based  on  the  whole  gene  set.  The  normalization  process 
converges  to  a  good  solution  after  six  iterations.  Figs.  1 1  and  12 
show  the  scatter  plots  of  initial  and  final  normalization  results. 
Once  the  algorithm  has  converged,  the  consistency  of  the  control 
gene  selection  is  significantly  improved.  Moreover,  there  are  no 
erroneous  matches  between  control  genes  for  the  last  two  adja¬ 
cent  iterations.  The  final  control  gene  subset  contains  37  genes. 
Finally,  the  MSE  of  3892  is  in  good  agreement  with  the  corre¬ 
sponding  results  of  the  existing  methods. 

We  next  considered  the  iterative  normalization  for  between- 
class  samples.  As  a  step  toward  improving  the  performance  of 
microarray  data  normalization,  we  have  put  considerable  effort 
into  conducting  various  studies  and  developing  reliable  control 
gene  selection  and  linear  regression  techniques.  More  precisely, 
we  aim  to  perform  an  unsupervised  normalization  when  con¬ 
fronted  with  unreliable  housekeeping  genes.  Experience  sug¬ 
gested  that  our  newly  proposed  method  can  achieve  this  goal. 
We  applied  our  algorithm  to  the  differential  expression  between 
LCC1  and  LCC9.  In  this  between-class  case,  10000  <  r  < 
24000  and  <j>  =  7r/4,  7t/8,  7t/16,  7t/48.  As  before,  the  initial 
parameters  are  estimated  based  on  the  whole  gene  set.  The  nor¬ 
malization  process  converges  on  a  good  solution  after  only  four 


Fig.  11.  Scatter  plot  of  initial  normalized  within-class  microarray  data  based 
on  selected  control  gene  subset  using  a  dynamic  window  function.  (Circle: 
before  normalization;  dot:  after  normalization.) 
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Fig.  12.  Scatter  plot  of  final  normalized  within-class  microarray  data  based  on 
selected  control  gene  subset  using  a  dynamic  window  function.  (Circle:  before 
normalization;  dot:  after  normalization.) 

iterations.  Figs.  13  and  14  show  the  scatter  plots  of  initial  and 
final  normalization  results.  The  final  control  gene  subset  con¬ 
tains  43  genes,  and  a  stable  and  satisfactory  MSE  of  6523  is 
reached. 

Finally,  we  used  our  previously  developed  the  VISDA 
algorithm  to  display  the  expression  patterns  of  different  cell 
line  samples  in  the  gene  expression  space  [13].  Ail  data  were 
normalized  using  the  new  method.  For  a  molecular  analysis 
of  breast  cancer,  the  profile  of  microarray  expression  is  the 
molecular  signature  of  interest.  The  representation  of  each 
sample  is  described  as  a  point  in  a  d-dimensional  gene  expres¬ 
sion  space  in  which  each  axis  represents  the  expression  level 
of  one  gene.  The  presence  of  well-separated  sample  groups 
implies  that  the  representations  of  samples  within  the  same 
group  are  close  to  each  other  in  this  gene  expression  space  but 
distant  from  those  of  other  samples.  Thus,  the  representations 
of  phenotype-specific  samples  form  clusters.  Fig.  15  shows  a 
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Fig.  13.  Scatter  plot  of  initial  normalized  between-class  microarray  data 
based  on  selected  control  gene  subset  using  a  dynamic  window  function. 
(Circle:  before  nonmalization;  dot:  after  normalization.) 


matrix  as  a  measure  of  the  separability  of  patterns,  our  new 
normalization  method  achieved  an  improved  performance  with 
respect  to  the  existing  methods. 

One  important  consideration  with  the  present  approach  is  the 
measure  of  quality  in  data  normalization  [11].  This  is  not  a  glam¬ 
orous  area,  but  progress  in  it  is  critical  for  the  future  success  of 
data  normalization  [12].  What  is  the  correct  control  gene  set  for 
a  direct  normalization  of  between-class  data  sets?  How  effec¬ 
tive  was  a  particular  normalization  method?  Did  the  succeeding 
analysis  come  to  the  correct  conclusion?  Benchmark  criteria  as¬ 
signment  in  data  normalization  are  very  different  and  difficult 
[5].  We  believe  that  in  data  normalization,  there  is  currently  no 
objective  measure  of  quality,  and  so  it  is  difficult  to  quantify 
the  merit  of  a  particular  data  normalization  technique.  The  ef¬ 
fectiveness  of  such  a  techniques  is  often  highly  data-dependent. 
However,  we  would  expect  this  iterative  normalization  method 
to  be  an  effective  tool  in  many  gene  microarray  applications. 
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Fig.  14.  Scatter  plot  of  final  normalized  between-class  microarray  data  based 
on  selected  control  gene  subset  using  a  dynamic  window  function.  (Circle: 
before  normalization;  dot:  after  normalization.) 


Fig.  1 5.  Projection  of  597-genc  dimensions  onto  top  three  principal  discrimi¬ 
nant  component  spaces  based  on  Fisher’s  scatter  matrix  measure  of  the 
separability  of  patterns.  With  an  accurate  data  normalization,  visual  exploration 
reveals  phenotype-specific  sample  clusters  in  gene  expression  space. 

projected  display  of  597-gene  dimensions  into  the  top  three 
principal  discriminative  component  spaces,  based  on  Fisher’s 
scatter  matrix  [9].  With  an  accurate  data  normalization,  visual 
exploration  reveals  three  phenotype-specific  sample  clusters 
in  gene  expression  space.  Using  the  trace  of  Fisher’s  scatter 
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ABSTRACT 

To  identify  genes  associated  with  survival  from  antiestrogens,  both 
serial  analysis  of  gene  expression  and  gene  expression  microarrays  were 
used  to  explore  the  transcriptomes  of  antiestrogen-responsive  (MCF7/ 
LCC1)  and  -resistant  variants  (MCF7/LCC9)  of  the  MCF-7  human  breast 
cancer  cell  line.  Structure  of  the  gene  microarray  expression  data  was 
visualized  at  the  top  level  using  a  novel  algorithm  that  derives  the  first 
three  principal  components,  fitted  to  the  antiestrogen-resistant  and 
-responsive  gene  expression  data,  from  Fisher’s  information  matrix.  The 
differential  regulation  of  several  candidate  genes  was  confirmed.  Func¬ 
tional  studies  of  the  basal  expression  and  endocrine  regulation  of  tran¬ 
scriptional  activation  of  implicated  transcription  factors  were  studied 
using  promoter-reporter  assays. 

The  putative  tumor  suppressor  interferon  regulatory  factor-1  is  down- 
regulated  in  resistant  cells,  whereas  its  nucleolar  phosphoprotein  inhibitor 
nucleophosmin  is  up-regulated.  Resistant  cells  also  up-regulate  the  tran¬ 
scriptional  activation  of  cyclic  AMP  response  element  (CRE)  binding  and 
nuclear  factor  kB  (NFkB)  while  down-regulating  epidermal  growth  factor 
receptor  protein  expression.  Inhibition  of  NFkB  activity  by  ICI  182,780  is 
lost  in  resistant  cells,  but  CRE  activity  is  not  regulated  by  ICI  182,780  in 
either  responsive  or  resistant  cells.  Parthenolide,  a  potent  and  specific 
inhibitor  of  NFkB,  inhibits  the  anchorage-dependent  proliferation  of  an- 
ti estrogen-resistant  but  not  antiestrogen-responsive  cells.  This  observation 
implies  a  greater  reliance  on  their  increased  NFkB  signaling  for  prolifer¬ 
ation  in  cells  that  have  survived  prolonged  exposure  to  ICI  182,780. 

These  data  from  serial  analysis  of  gene  expression  and  gene  microarray 
studies  implicate  changes  in  a  novel  signaling  pathway,  involving  inter¬ 
feron  regulatory  factor-1,  nucleophosmin,  NFkB,  and  CRE  binding  in  cell 
survival  after  antiestrogen  exposure.  Cells  can  up-regulate  some  estrogen- 
responsive  genes  while  concurrently  losing  the  ability  of  antiestrogens  to 
regulate  their  expression.  Signaling  pathways  that  are  not  regulated  by 
estrogens  also  can  be  up-regulated.  Thus,  some  breast  cancer  cells  may 
survive  antiestrogen  treatment  by  bypassing  specific  growth  inhibitory 
signals  induced  by  antagonist-occupied  estrogen  receptors. 
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INTRODUCTION 

ERs7  are  nuclear  transcription  factors,  their  activities  being  affected 
by  the  nature  of  the  ligand  bound  and  the  pattern  of  genes/proteins 
expressed  within  cells  (cellular  context;  Ref.  1).  Antiestrogens  com¬ 
pete  with  endogenous  estrogens  for  activation  of  ER,  and  induce  both 
cell  cycle  arrest  and  apoptosis  in  responsive  cells  (2).  Neither  the 
genes  regulated  by  antiestrogens  that  signal  to  apoptosis  nor  those 
genes  that  confer  an  acquired  antiestrogen  resistance  have  been  iden¬ 
tified.  Nonetheless,  antiestrogenic  drugs  are  effective  in  both  prem¬ 
enopausal  and  postmenopausal  breast  cancer  patients,  and  in  the 
metastatic  and  adjuvant  settings  (3).  The  most  widely  used  antiestro¬ 
gen  in  current  clinical  practice  is  the  triphenylethylene  TAM.  Clinical 
experience  with  this  drug  likely  now  exceeds  10  million  patient  years. 
When  patients  with  metastatic  disease  are  selected  for  treatment  based 
on  the  ER  and  PgR  content  of  their  tumors,  responses  are  seen  in  up 
to  75%  of  tumors  expressing  both  receptors  (2).  TAM  also  reduces  the 
incidence  of  ER-positive  breast  cancers  in  high  risk  women  (4). 

Other  antiestrogens  have  emerged  recently,  most  notably  the  ben- 
zothiophene  Raloxifene  and  the  steroidal  ICI  182,780  (Faslodex). 
Both  drugs  appear  to  have  significant  clinical  activity  and  may  have 
better  toxicological  profiles  when  compared  with  TAM  (2).  Faslodex 
has  significant  activity  in  T AM-resistant  patients  (5),  consistent  with 
data  obtained  previously  with  TAM-resistant  human  breast  cancer 
cells  selected  in  vitro  (6). 

Despite  the  utility  of  antiestrogens,  most  tumors  that  initially 
respond  to  these  drugs  will  recur  and  require  alternative  systemic 
therapies  (2).  Unfortunately,  the  precise  mechanisms  that  confer 
resistance  remain  unknown.  Change  to  an  anti  estrogen-stimulated 
phenotype  has  been  described  in  some  animal  models  (6,  7).  This 
phenotype  may  occur  in  up  to  20%  of  breast  cancer  patients  but  a  loss 
of  responsiveness  to  antiestrogens  may  be  the  more  common  pheno¬ 
type  (2).  The  expression  of  mutant  ER  proteins  and  splice  variants  has 
been  reported  but  the  functional  role  of  these  in  endocrine  resistance 
remains  unclear  (2).  Most  tumors  acquiring  antiestrogen  resistance  do 
so  while  retaining  expression  of  ER  (8).  Thus,  whereas  lack  of  ER 
expression  is  a  major  form  of  de  novo  antiestrogen  resistance,  other 
mechanisms  must  be  active  in  most  instances  of  acquired  resistance 
(2).  The  persistent  expression  of  ER  in  tumors  with  acquired  resist¬ 
ance  suggests  that  some  cells  expressing  this  phenotype  may  either 
require  ER  expression  and/or  reflect  the  altered  expression  of  other¬ 
wise  estrogen-regulated  genes. 

Because  ER-mediated  transcription  is  directly  affected  by  anties¬ 
trogens,  we  initially  hypothesized  that  antiestrogen  resistance  might 
include  perturbations  in  the  patterns  of  expression  and/or  regulation  of 


'  The  abbreviations  used  are:  ER,  estrogen  receptor;  CRE,  cyclic  AMP  response 
element;  CCS-IMEM,  improved  minimal  essential  medium  supplemented  with  5%  char¬ 
coal  calf  stripped  serum;  EGF-R,  epidermal  growth  factor  receptor;  IRF-I,  interferon 
regulatory  factor- 1;  NPM,  nucleophosmin;  PgR,  progesterone  receptor;  SAGE,  serial 
analysis  of  gene  expression;  TAM,  Tamoxifen;  XBP-1,  X-box  binding  protein- 1;  FACS, 
fluorescence-activated  cell  sorting:  NFkB.  nuclear  factor  kB;  EGR-1,  early  growth 
response  factor- 1;  TNFa,  tumor  necrosis  factor  a. 
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a  subset  of  all  of  the  ER-regulated  genes  (1).  To  address  this  hypoth¬ 
esis,  we  first  generated  a  novel  series  of  human  breast  cancer  variants 
from  the  MCF-7  human  breast  cancer  cell  line.  These  cells  have 
different  growth  requirements  for  estrogen  and  exhibit  differential 
sensitivities  to  TAM  and  1CI  182,780  (9-1 1).  In  this  study,  we  focus 
on  MCF7/LCC1  cells  (estrogen-independent,  T AM-responsive,  and 
ICI  182,780  responsive)  and  MCF7/LCC9  cells  (estrogen -indepen¬ 
dent,  ICI  182,780  resistant,  and  TAM  cross-resistant;  Ref.  11).  Be¬ 
cause  the  cells  exhibit  comparable  cell  cycle  profiles8  and  are  both 
MCF-7  variants,  we  can  exclude  the  altered  expression  of  genes 
related  solely  to  differences  in  both  genetic  background  and  cell  cycle 
distribution.  A  direct  comparison  of  these  respective  transcriptomes 
should  identify  genes  associated  with  survival  from  long-term  anti¬ 
estrogen  exposure. 

Several  techniques  are  now  available  to  explore  the  transcriptomes 
of  tumors  and  experimental  models.  However,  the  most  effective 
approach  remains  a  matter  of  debate  (12).  Studies  in  breast  cancer 
have  been  limited,  most  simply  attempting  to  identify  the  genes 
expressed  in  breast  cancers.  For  example,  a  recent  study  by  Perou 
et  al  (13)  explored  data  from  excisional  breast  biopsies  from  42 
individuals.  Gene  clusters,  identified  by  exploration  of  the  data  struc¬ 
ture,  include  those  associated  with  ER,  HER-2,  and  IFN-induced 
genes.  A  similar  cluster  of  IFN-regulated  genes  was  identified  in  the 
breast  cancer  cell  lines  included  in  the  NIH  drug  screening  program 
(14).  Studies  comparing  the  gene  expression  profiles  of  specific  breast 
cancer  phenotypes  include  an  examination  of  histologically  different 
samples  from  a  single  breast  cancer  lesion  (15)  and  a  preliminary 
analysis  of  a  TAM-stimulated  xenograft  model  (16).  None  of  these 
reports  directly  addressed  either  the  function  or  potential  role  of  the 
specific  genes  identified.  We  have  used  two  different  but  complemen¬ 
tary  approaches,  SAGE  and  gene  expression  microarrays.  These  ap¬ 
proaches  would  not  be  expected  to  provide  identical  data  because  not 
all  of  the  genes  identified  by  SAGE  are  on  the  microarrays,  some 
genes  identified  on  the  cDNA  arrays  may  be  confounded  by  cross¬ 
hybridization  to  homologous  RNAs,  and  the  ability  to  detect  signifi¬ 
cant  differences  between  the  SAGE  databases  is  affected  by  the 
relative  abundance  of  the  tags  and  the  size  of  the  databases.  We 
approached  both  technologies  as  means  to  sample  the  transcriptomes 
of  MCF7/LCC1  and  MCF7/LCC9  cells,  and  to  generate  data  that 
would  allow  us  to  begin  testing  our  hypothesis  implicating  estrogen- 
regulated  genes  in  antiestrogen  resistance.  We  now  show  that  cells  can 
survive  prolonged  antiestrogen  treatment  by  altering  the  expression, 
patterns  of  regulation,  and  functional  activation  of  specific  estrogen- 
regulated  genes. 

MATERIALS  AND  METHODS 

Cell  Lines.  MCF7/LCC1  cells  were  derived  from  the  estrogen -dependent 
MCF-7  human  breast  cancer  cell  line  after  selection  for  growth  in  ovariccto- 
mized  nude  mice  (9,  17).  MCF-7/LCC9  cells  were  obtained  by  an  in  vitro 
stepwise  selection  of  the  estrogen-independent  but  anti  estrogen -responsive 
MCF7/LCC1  cells  against  the  steroidal  antiestrogen  ICI  182,780  (Faslodex). 
MCF7/LCC9  cells  arc  ICI  182,780  resistant  and  TAM  cross-rcsistant,  express 
ER  and  PgR,  and  exhibit  an  estrogen-independent  but  responsive  phenotype 
(11).  MCF7/LCC1  and  MCF7/LCC9  cells  were  routinely  passaged  in  Im¬ 
proved  Minimal  Essential  Medium  without  phenol  red  (Biofluids.  Bethesda, 
MD)  supplemented  with  5%  CCS-IMEM.  Serum  was  stripped  of  endogenous 
estrogens  as  described  previously  and  is  estimated  to  contain  <  10  fM  estrogen 
(18).  Vehicle  for  all  of  the  hormonc/antihormone  treatments  was  ethanol  (final 
concentration  <0.1%  v/v).  All  of  the  cell  cultures  were  maintained  at  37°C  in 
a  humidified  5%  C02:95%  air  atmosphere  and  shown  to  be  free  of  contami¬ 
nation  with  Mycoplasma  species  as  determined' by  solution  hybridization  to 


8  R.  Clarke,  unpublished  observations. 


Mycoplasma- specific,  radiolabeled,  RNase  riboprobes  (Gen-Probe  Inc.,  San 
Diego,  CA). 

SAGE  Analyses.  SAGE  was  performed  as  described  previously  (19). 
Polyadenylic  acid  mRNA  was  harvested  from  cells  using  biotin  labeled- 
oligodeoxythymidylic  acid  magnetic  beads  (Promega  PolyATract  System  1000 
kit;  Promega,  Madison,  WI)  and  treated  with  DNase  I  enzyme  to  remove  any 
contaminating  DNA.  mRNA  (5  fig)  was  converted  to  double-stranded  cDNA 
using  the  Life  Technologies,  Inc.  cDNA  Synthesis  kit  (Life  Technologies,  Inc., 
Rockville,  MD).  Biotinylated  cDNA  was  completely  cleaved  with  Nia  III  and 
the  3 '-end  digested  fragments  extracted  with  magnetic  streptavidin  beads.  The 
cDNA  was  evenly  divided  and  ligated,  one  half  to  linker  A  and  the  other  half 
to  linker  B  (19).  Cleavage  of  the  cDNA  by  BsmFl  produced  1 1-13  bp  oligo 
DNA  tags  with  linkers,  which  were  blunt-ended  with  T4  polymerase.  Linkers 
A  and  B  were  ligated  together  to  form  ditags,  which  were  then  amplified  by 
PCR  using  primers  to  linkers  A  and  B.  Ditags  (22-26  bp)  were  gel  purified  and 
ligated  into  concatenated  polytags.  The  polytags  were  purified  and  cloned  into 
the  Sphl -digested  pZeorl  vector,  which  was  transferred  to  competent 
TOPI  OF'  cells  by  electroporation.  Positive  clones  were  selected  overnight  at 
37°C  for  growth  on  low-salt  Luria-Bertani  bacterial  plates  supplemented  with 
Luria-Bertani-Zeocin  (50  pgf ml)  and  isopropyl  /3-o-thiogalactopyranoside  (1 
mM).  Colonies  were  screened  for  plasmids  containing  appropriate  inserts  by 
size  fractionating  PCR  products,  obtained  using  Ml 3  forward  and  reverse 
primers,  in  agarose  gels.  PCR  products  containing  concatamers  of  >600  bp 
were  purified  and  sequenced. 

Characteristics  of  the  SAGE  databases  are  shown  in  Table  1 .  We  compared 
the  MCF7/LCC1  and  MCF7/LCC9  databases,  using  the  SAGE  version  1.00 
software  (kindly  provided  by  Dr.  K.  W.  Kinzler,  Johns  Hopkins  University, 
Baltimore,  MD),  to  identify  putatively  differentially  expressed  genes.  Only  a 
representative  sample  of  these  can  be  presented.  The  genes  presented  in 
Table  2  were  primarily  selected  based  on:  (a)  fold  difference  ^2-fold;  { b )  that 
the  Tags  compared  should  represent  <2  genes;  and  (c)  that  a  Tag  found  in 
either  the  MCF7/LCC1  and/or  MCF7/LCC9  SAGE  libraries  must  represent 
5: 0.10%  of  the  database.  Evidence  that  a  gene  was  already  known  to  be 
expressed  in  breast  cancers  also  was  considered.  None  of  these  criteria  were 
considered  an  absolute  requirement  for  gene  selection.  Whereas  2-fold  was 
selected  as  the  cutoff,  biologically  critical  events  can  be  controlled  by  genes 
that  exhibit  a  fold  regulation  as  small  as  50%  (20).  As  described  recently  by 
Man  et  al  (21),  x1  analyses  were  used  to  compare  the  proportions  of  specific 
tags  in  each  database. 

RNA  Isolation,  Generation  of  Probes,  and  Hybridization  of  Gene  Mi¬ 
croarrays.  Each  probe  was  generated  from  an  independent  cell  culture,  each 
culture  being  grown  on  a  different  day  but  using  identical  cell  culture  condi¬ 
tions.  Six  MCF7/LCCI  and  five  MCF7/LCC9  cell  cultures  were  used.  RNA 
was  isolated  from  proliferating,  subconfluent  monolayers  of  each  cell  line 
using  the  TRIzol  reagent  (Life  Technologies.  Inc.,  Grand  Island,  NY).  RNA 
quality  was  determined  by  standard  spectroscopic  and  gel  electrophoresis 
analyses. 

Probes  for  the  Clontech  Atlas  gene  microarrays  (Clontech,  Palo  Alto,  CA) 
were  made  as  described  by  the  manufacturer.  Briefly,  1  fig  of  Dnase-treated 
mRNA  was  primed  with  the  Clontech  cDNA  Synthesis  Primer  mix.  The 
product  was  reverse  transcribed  into  radiolabeled  cDNA  with  [y-32P]dATP 
(Amcrsham  Life  Science  Inc.,  Arlington  Heights,  IL),  and  the  reaction  incu¬ 
bated  at  50°C  for  25  min  and  terminated  by  adding  0.1  m  EDTA  (pH  8.0). 
Radiolabeled  cDNA  was  purified  and  eluted  through  a  NucIcoSpin  Extraction 
Column  (centrifuged  at  14,000  rpm).  The  cDNA  probe  was  denatured  with  1 


Tabic  I  Characteristics  of  the  SAGE  libraries  from  MCF7/LCC1  and 
MCF7/LCC9  cells 


Characteristics  of  SAGE  libraries 

Tags" 

Gene 

hits 

Tags  sequenced  from  MCF7/LCC1  cells 

12,816* 

5,783 

1 

Tags  sequenced  from  MCF7/LCC9  cells 

11,109* 

1,170 

2 

Number  of  Tags  identified 

10,518 

208 

3 

Number  of  known  Tags* 

7,221 

38 

4 

Number  of  unknown  Tags 

3,297 

10 

5 

a  Number  of  Tags  representing  a  corresponding  number  of  gene  hits,  e.g.,  5,783  Tags 
arc  specific  for  single  genes,  whereas  208  Tags  could  identify  up  to  3  genes  each. 

*  Number  of  Tags  in  each  SAGE  database. 

*’  Includes  expression  sequence  tags. 
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Table  2  Differentially  expressed  genes  identified  in  the  MCF7/LCC1  and  MCF7/LCC9  SAGE  libraries 


Putative  gene" 

.  Unigene  no. 

MCF7/LCC1 

MCF7/LCC9 

Difference* 

pc 

Gene  function 

N-ras-rclatcd  gene 

Hs.260523 

2 

20 

10-fold 

<0.001 

G-protcin 

Cathepsin  D 

Hs.343475 

7 

'  34 

5-fold 

<0.001 

Protease  involved  in  tumor  invasion 

XBP-1 

Hs.  149923 

7 

25 

4-fold 

<0.001 

Transcription  factor 

Prcfoldin  5 

Hs.288856 

6 

21 

4-fold 

0.002 

Chaperone  for  unfolded  proteins 

HSP-27 

Hs.76067 

23 

55 

2-fold 

0.001 

Stress  response  protein 

Vit  B-l  2-binding  protein 

Hs.2012 

17 

37 

2-fold 

0.002 

Vitamin-binding  protein 

NPM 

Hs.9614 

10 

14 

1.5-fold 

>0.05 

Oncogenic  nucleolar  protein 

L14 

Hs.738 

13 

2 

6-fold 

0.021 

Ribosomal  protein 

Death-associated  protcin-6 

Hs.336916 

11 

2 

6-fold 

0.049 

Apoptosis-associated  protein 

EF-y 

Hs.2186 

22 

6 

4-fold 

0.014 

Translation  elongation  factor 

Ferritin,  heavy  polypeptide- 1 

Hs.62954 

54 

16 

3-fold 

<0.001 

Iron-binding  protein 

a  The  gene  designations  arc  considered  putative,  although  the  identity  of  most  genes  designated  in  this  fashion  have  been  shown  to  be  correct.  These  genes  include  those  Tags  where: 
(a)  the  fold  difference  is  &2-fold;  (b)  the  Tag  could  represent  <2  genes:  and  (c)  represents  0.1%  of  either  the  MCF7/LCC1  and/or  MCF7/LCC9  SAGE  library. 
h  Predicted  fold  difference  in  gene  expression  between  MCF7/LCC1  vs.  MCF7/LCC9  cells. 
c  Obtained  by  yf  analyses;  P  estimated  to  3  significant  figures. 

''NPM  (not  statistically  significant)  is  shown  because  we  know  it  to  be  both  estrogen  regulated  and  associated  with  TAM  treatment  in  patients. 


M  NaOH  and  1 0  mM  EDTA,  and  incubated  at  68°C  for  20  min.  c0M  DNA  and 
1  m  NaH2P04  (pH  7.0)  were  added  to  the  denatured  probe,  and  incubated  at 
68°C  for  an  additional  10  min. 

Each  Atlas  Array  (Clontcch)  was  prehybridized  with  5  ml  of  ExprcssHyb 
buffer  (Clontcch)  and  0.5  mg  of  denatured  DNA  from  sheared  salmon  testes  at 
68°C  for  30  min  with  continuous  agitation.  The  cDNA  probe,  prepared  as 
described  above,  was  then  added  and  allowed  to  hybridize  overnight.  The  array 
was  washed  four  times  with  2X  SSC  containing  1%  (w/v)  SDS  for  30  min  at 
68°C  and  once  with  0.1  x  SSC  containing  0.5%  (w/v)  SDS  for  30  min  at  68°C. 
One  final  wash  was  performed  with  2X  SSC  for  5  min  at  room  temperature. 
The  Atlas  Array  was  scaled  in  plastic  and  signals  detected  by  phosphorimage 
analysis  using  a  Molecular  Dynamics  Storm  phosphorimager  (Molecular  Dy¬ 
namics,  Sunnyvale,  CA).  Each  filter  was  used  only  once. 

Measuring  NPM  and  EGF-R  Protein  Levels.  Established  methods  were 
used  for  performing  and  quantifying  Western  analyses  of  NPM  (22,  23). 
Briefly,  10  pug  of  protein  was  loaded  onto  an  SDS-PAGE  gel  and  fraction¬ 
ated  under  reducing  conditions  [5%  (v/v)  j3-mcrcaptoethanol].  To  account 
for  within-gel  differences,  samples  were  loaded  in  a  random  sequence  onto 
each  gel.  Proteins  were  blotted  onto  nitrocellulose  membrane  and  the  blots 
probed  with  an  anti-NPM  monoclonal  antibody  (kindly  provided  by  Dr. 
Pui-Kwong  Chan,  Baylor  College  of  Medicine,  Houston,  TX;  Ref.  24). 
After  transfer  to  the  membranes,  equal  protein  loading  was  confirmed  by 
staining  the  nitrocellulose  with  Ponceau  S  as  is  widely  reported  (22,  23, 
25).  Any  material  remaining  in  the  gels  were  stained  by  Coomassie  Blue. 
This  approach  provides  an  adequate  and  appropriate  estimate  for  equiva¬ 
lence  of  protein  loading  (22,  23,  25).  Immunoreactivity  was  visualized 
using  a  horseradish  peroxidase-linked  goat  antimouse  IgG  and  the  en¬ 
hanced  chemiluminescence  detection  system  (Amcrsham  Life  Science 
Inc.).  Chemiluminescence  was  densitometrically  measured  using  a  Quan¬ 
tity  One  Scanning  and  Analysis  System  (pdi,  Huntingdon,  NY). 

EGF-R  is  expressed  at  low  levels  in  MCF-7  cells  and  cannot  readily  be 
detected/quantified  by  Western  blot.  Consequently,  we  measured  immunofluo- 
rescently  labeled  EGF-R  protein  by  FACS.  For  each  cell  line,  EGF-R  immu¬ 
nofluorescence  was  performed  by  rinsing  5  X  106  cells  once  in  PBS  and 
pelleting  cells  by  centrifugation  at  1000  rpm  for  5  min  at  room  temperature. 
Cell  pellets  were  resuspended  in  100  ju.1  of  an  anti-EGF-R  mouse  monoclonal 
antibody  that  recognizes  the  extracellular  domain  of  the  receptor  (EGF-R 
antibody- 1;  NeoMarkers,  Lab  Vision  Corp.,  Fremont,  CA;  200  fi g/ml  diluted 
1:50  in  PBS),  and  incubated  at  room  temperature  for  1  h.  Cell  pellets  were  then 
resuspended  in  1:50  dilution  of  R-phycoerythrin-conjugatcd  goat  antimouse 
IgG-2a  (CALTAG  Laboratories,  Burlingame,  CA)  and  incubated  in  the  dark 
for  30  min.  After  rinsing  in  PBS,  cells  were  again  pelleted,  fixed  by  resus¬ 
pending  in  1%  paraformaldehyde,  and  fluorescence  measured  by  FACS. 
Control  cells  were  treated  either  with  secondary  antibody  alone  or  with  no 
antibody.  FACS  was  performed  on  a  FACStar1111^  flow  cytometer  (Becton- 
Dickinson,  Mountain  View,  CA)  at  488  ran. 

RNase  Protection  Analysis  of  IFN  Regulatory  Factor-1  mRNA  Expres¬ 
sion.  Total  RNA  was  isolated  using  the  TRIzol  reagent  (Life  Technologies, 
Inc.)  according  to  the  manufacturer’s  instructions.  The  IRF-1  riboprobe  was 
made  by  in  vitro  transcription  of  a  360-bp  fragment  of  the  IRF-I  cDNA.  The 
36B4  loading  control  riboprobe  was  similarly  obtained  from  a  220-bp  fragment 


of  the  36B4  cDNA  (17).  Riboprobes  were  labeled  by  the  addition  of  [32P]UTP 
(Amersham  Life  Sciences  Inc.)  in  the  transcription  buffer.  To  achieve  bands 
for  the  two  genes  with  similar  intensities,  the  36B4  riboprobe  was  made  with 
a  specific  activity  of  —20%  that  of  the  IRF-1  riboprobe.  The  RNase  protection 
assays  were  performed  as  described  previously  (26).  Briefly,  total  RNA  (30 
jitg),  the  IRF-1  riboprobe,  and  the  36B4  riboprobe  were  hybridized  overnight 
at  50°C.  After  digestion  with  RNase  A,  the  protected  fragments  were  size 
fractionated  on  6%  acrylamide  Tris-borate  EDTA-urea  minigels  (Novex,  San 
Diego,  CA).  The  gels  were  dried  and  the  respective  signals  quantified  by 
phosphorimager  analysis  (Molecular  Dynamics). 

Estimation  of  the  Transcriptional  Activation  of  CREs  and  NFkB.  Two 
commercially  available  promoter-reporter  assays  were  used  to  measure  NFkB 
and  CRE  transcriptional  activities.  Experiments  were  performed  as  described 
by  the  manufacturer  (Stratagene,  La  Jolla,  CA).  Briefly,  firefly  luciferase 
reporter  constructs,  under  the  control  of  the  appropriate  enhancer  elements  and 
/raws-activator  constructs,  were  provided  in  the  PathDctect  in  vivo  signal 
transduction  pathway  cLy-reporting  system  (Stratagene).  Cells  were  grown  to 
90%  confluence  in  5%  CCS-IMEM  medium  and  seeded  at  8  X  104  cells  into 
each  well  of  24-wcll  tissue  culture  dishes.  After  incubation  for  12-24  h,  cells 
were  transiently  transfected  with  the  appropriate  plasmids  using  the  Qiagen 
Supcrfcct  transfection  reagent  as  described  by  the  manufacturer  (Qiagen, 
Valencia,  CA).  The  ratio  of  plasmid  to  Supcrfect  reagent  was  250  ng:l  pd,  with 
a  transfection  time  of  2.5  h. 

Estrogen  (5  nM)  and  IC1  182,780  treatments  (10  nM)  were  administered  for 
48  h  after  transfection  in  CCS-IMEM.  Transfected  cells  were  harvested  and 
firefly  luciferase  activity  measured  using  the  Stratagene  assay  system.  Activity 
is  expressed  in  relative  light  units  from  a  20-^1  sample  as  detected  by 
luminometry.  Each  measurement  is  from  duplicate  samples,  independent  ex¬ 
periments  being  repeated  on  different  days.  Normalization  of  transfection 
efficiency  was  made  to  the  Renilla  luciferase  reporter  construct,  under  the 
control  of  the  cytomegalovirus  promoter  (Promega).  The  Renilla  luciferase 
assay  was  performed  using  the  Promega  Dual -luciferase  reporter  assay  system. 

Assessment  of  Growth  Response  to  Parthenolide.  MCF7/LCC1  and 
MCF7/LCC9  cells  were  plated  in  96-well  tissue  culture  plates  and  incubated 
for  24  h  in  0.2  ml  of  5%  CCS-IMEM.  Medium  was  removed  and  replaced  with 
fresh  5%  CCS-IMEM  containing  either  vehicle  (0.1%  DMSO)  or  parthenolide 
(300  nM  and  600  nM).  Cells  were  refed  every  third  day  with  the  appropriate  cell 
culture  medium.  Cell  growth  was  determined  on  day  6,  using  a  crystal  violet 
assay  where  dye  uptake  is  directly  related  to  cell  number  (27).  Cells  were 
incubated  for  30  min  with  crystal  violet  stain  [0.5%  (w/v)  crystal  violet  in  25% 
(v/v)  methanol)  at  25°C.  Unincorporated  stain  was  removed  with  deionized 
water  and  the  cells  allowed  to  dry  at  room  temperature.  Incorporated  dye  was 
extracted  into  0.1  ml  of  0.1  M  sodium  citrate  in  50%  (v/v)  ethanol  for  10-15 
min  at  room  temperature.  Absorbance  was  read  at  570  nm  using  a  Molecular 
Devices  Fmax  kinetic  microplate  reader. 

Statistical  Analyses  and  Analysis  of  Gene  Expression  Microarray  Data. 
t  tests  were  used  to  compare  control  and  experimental  groups  as  appropriate  for 
the  RNase  protection,  Western  blot,  promoter-reporter,  and  cell  proliferation 
assays.  All  of  the  tests  were  two-tailed,  with  statistical  significance  established 
at  P  <  0.05,  unless  stated  otherwise. 

For  the  gene  array  studies,  background  signal  was  estimated  locally  and 
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subtracted  from  the  signal  obtained  from  its  target  cDNA,  producing  the 
background-corrected  data.  These  corrections  were  done  using  the  algorithms 
in  Pathways  4.0  (Research  Genetics  Inc.,  Huntsville,  AL).  Background-cor¬ 
rected  data  were  normalized  to  account  for  differences  in  probe-specific 
activity,  hybridization,  and  other  variables  among  replicates  (28).  Normaliza¬ 
tion  was  accomplished  using  the  mean  value  of  all  of  the  background-corrected 
signals  on  each  array. 

Different  approaches  have  been  used  to  analyze  data  from  gene  array 
studies.  Some  methods  are  simply  based  on  fold-regulation  (29),  others  arc 
more  statistically  based  (16,  30),  and/or  apply  an  infonnatics-based  exploration 
of  data  structure  (31,  32).  The  optimal  approach  remains  a  subject  of  consid¬ 
erable  debate  (30).  As  with  most  gene  microarray  studies,  our  data  set  is  high 
in  dimensionality  (597  dimensions)  but  the  number  of  replicates  is  limited  by 
the  resource-intensive  nature  of  the  technology.  The  relatively  few  replicates 
limits  the  applicability  of  normal  mixture  models  and  other  analyses  that  can 
operate  in  high  dimensional  data  space  (33,  34)  and  often  generates  noisy  data 
sets. 

Previously,  we  have  reported  a  hierarchical  visualization  algorithm  that  can 
reveal  all  of  the  major  aspects  of  the  multimodal  data  points,  which  concur¬ 
rently  exist  in  a  high  dimensional  gene  expression  space  (35,  36).  Using  this 
algorithm,  our  data  can  be  projected  from  597  dimensions  to  two  or  three 
dimensions  (multidimensional  scaling).  This  is  accomplished  by  respectively 
deriving  the  first  three  principal  components  fitted  to  the  antiestrogen  respon¬ 
sive  (MCF7/LCC1)  and  resistant  (MCF7/LCC9)  gene  expression  data  (Fig.  1). 
Thus,  we  evaluate  the  data  structure  subsets  visually  and  assess  whether  these 
contain  differentially  expressed  genes  that  may  contribute  to  the  respective 
phenotypes. 

Because  we  can  visualize  data  structure,  our  next  priority  was  to  identify  a 
simple,  supervised  approach  for  reducing  the  dimensionality  of  the  data  with¬ 
out  affecting  its  structure.  Thus,  we  applied  geometric  and  simple  descriptive 
statistical  approaches  to  the  normalized  data  before  and  after  a  logarithmic 
transformation  of  these  data.  As  noted  previously,  the  distribution  of  the 
expression  data  for  each  gene  is  unknown  (30),  and  it  is  unclear  whether  these 
violate  the  normal  distribution  required  for  parametric  analyses.  Indeed,  it 
seems  likely  that  the  distribution  assumption  required  will  be  normal  for  some 
genes  and  not  for  others.  Whereas  most  investigators  analyze  data  transformed 
by  a  logarithmic  function,  those  genes  with  values  that  appear  normally 


Fig.  1 .  Visual  representations  of  the  structure  of  the  multidimensional  gene  microarray 
data.  A,  three-dimensional  representation  of  597  dimensions  (A,  MCF7/LCC 1 ;  O,  MCF7/ 
LCC9)  where  the  top  three  principal  components  capture  81.2%  of  the  cumulative 
variance  in  the  data.  B,  three-dimensional  representation  of  7  dimensions  (data  from  Table 
3)  where  the  top  three  principal  components  capture  98.9%  of  the  cumulative  variance  in 
the  data.  Axes  represent  the  first  three  principal  components  derived  from  the  gene 
expression  data  (79,  80).  Plots  arc  rotated  to  provide  the  optimal  visualization.  In  both 
plots,  a  plane  is  shown  demonstrating  the  linear  separability  of  the  MCF7/LCC1  (n  —  5) 
and  MCF7/LCC9  (n  =  4)  gene  expression  profiles. 


distributed  before  transformation  may  no  longer  have  this  distribution  once 
transformed. 

To  be  inclusive,  we  used  simple  statistics  (/  tests)  to  explore  the  data.  The 
inflated  type-1  error  from  multiple  comparisons  should  overestimate  (false 
positive)  significant  differences.  Wc  considered  this  preferable  to  a  high 
incidence  of  false- negative  estimates,  which  would  lead  to  the  exclusion  of 
potentially  informative  genes.  The  inclusion  of  uninformative  genes  (false 
negatives)  is  less  problematic  at  this  stage  of  the  exploration.  We  used 
Student’s  t  test,  a  t  test  for  unequal  variance  (assumes  norma!  distribution)  and 
the  nonparametric  (distribution-free)  Wilcoxon  signed  rank  test.  Logarithm 
transformed  and  nontransformed  data  were  explored.  This  approach  is  similar 
to  using  a  F  test  as  described  recently  by  Hedenfalk  et  al  (37). 

t  test  results  were  evaluated  and  candidate  genes  selected  with  which  to 
reconstruct  a  lower  dimensional  data  set  that  should  retain  most  of  the  infor¬ 
mation  apparent  in  the  top  level  visualization.  However,  the  t  test  results  were 
only  one  of  several  criteria  used  to  guide  gene  selection,  and  only  a  subset  of 
those  genes  that  appear  to  be  differentially  regulated  are  presented.  These 
genes  were  selected  by  comparing  the  results  of  t  tests  on  logarithm  trans¬ 
formed  and  untransformed  data,  fold-regulation  (~2-fold  or  greater  was  se¬ 
lected  because  this  difference  is  likely  to  be  confirmed  in  independent  analy¬ 
ses),  the  distribution  of  the  background -corrected  and  normalized  data  for  each 
gene  (some  genes  appeared  strongly  differentially  regulated  but  did  not  gen¬ 
erate  statistically  significant  differences  because  of  heterogeneity  in  the  data), 
and  the  probable  relevance  to  breast  cancer  of  each  gene. 

Where  the  gene  subsets  (reduced  dimensional  data)  provide  a  reasonable 
description  of  the  entire  expression  data,  the  replicate  profiles  of  the  resistant 
and  responsive  cells  should  exist  in  separable  data  space  (35, 36).  Furthermore, 
if  the  profiles  arc  adequately  defined  by  a  small,  rational  gene  subset,  some  of 
its  members  likely  represent  differentially  expressed  and  functionally  relevant 
genes.  We  acknowledge  that  our  approach  is  limited,  and  is  probably  only 
applicable  to  simple  comparisons  within  related  cell  culture  models. 


RESULTS 

Genes  Implicated  by  SAGE.  The  data  in  Table  1  show  the  num¬ 
ber  of  different  genes  identified.  Most  genes  were  commonly  ex¬ 
pressed,  and  were  not  differentially  expressed  between  the  MCF7/ 
LCC 1  and  MCF7/LCC9  cells.  A  selection  of  the  genes  identified  by 
SAGE,  and  predicted  to  be  differentially  expressed  in  MCF7/LCC1 
and  MCF7/LCC9  SAGE  databases,  is  shown  in  Table  2.  Presentation 
of  all  of  the  genes  expressed  and/or  differentially  expressed  is  beyond 
the  scope  of  a  single,  focused  study.9  The  criteria  applied  for  gene 
selection  are  described  in  “Materials  and  Methods.”  NPM  was  in¬ 
cluded  because  we  already  know  it  to  be  both  estrogen  regulated  (23) 
and  indirectly  associated  with  TAM  treatment  in  patients  (38).  Con¬ 
firmation  of  the  differential  expression  of  NPM  (see  Table  2  and 
Fig.  2 B)  and  altered  CRE  binding  activity  (the  function  of  XBP-1;  see 
Table  2  and  Fig.  3 B)  indicate  that  these  represent  reasonable  criteria 
for  gene  selection.  Currently,  the  XBP-1  and  NPM  are  the  only  genes 
from  the  SAGE  database  comparisons  for  which  we  have  attempted  to 
confirm  differential  expression/activation. 

Comparing  the  SAGE  databases  identifies  several  genes  that  are 
up-regulated  in  MCF7/LCC9  cells  compared  with  MCF7/LCC1  cells. 
These  genes  include  XBP-1,  NPM,  cathepsin  D,  HSP-27,  and  n-ras. 
Increased  CRE  activity  is  indicated  by  the  up-regulation  of  XBP-1, 
which  regulates  gene  transcription  through  these  response  elements 
(39).  XBP-1  is  involved  in  regulating  the  expression  of  several  tissue- 
specific  genes  including  tissue  inhibitor  of  metalloproteinases,  os- 
teopontin,  and  osteocalcin  (40).  Significantly,  both  Perou  et  al  (13) 
and  West  et  al  (41)  recently  identified  XBP-1  as  being  associated 
with  ER  gene  expression  clusters  in  human  breast  tumor  biopsies. 
NPM  is  induced  by  estrogen  in  MCF-7  cells  and  is  up-regulated  in 
estrogen-independent  cells  (23).  NPM  also  provokes  an  autoimmune 


9  http  ://clarkcl  abs  .gcorgetown.edu/gu_eLaVgu_eLaLlinks.htm/. 
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Fig.  2.  Confirmation  of  the  differential  expression  of 
NPM,  EGF-R,  and  IRF-1  in  MCF7/LCC1  and  MCF7/LCC9 
cells.  A,  EGF-R  protein  immunofluorescence  as  measured  by 
FACS  (representative  figure  of  three  experiments).  Arrows 
indicate  EGF-R  signal,  other  signals  are  controls  (no  anti¬ 
body;  primary  antibody  but  no  secondary  antibody).  Axes  arc 
abscissa  =  fluorescence;  ordinate  =  cell  counts.  B,  NPM 
protein  as  measured  by  Western  blotting  {*P  2£  0.02)  and 
represented  as  a  percentage  of  control  (MCF-7  cells  growing 
in  CCS-IMEM);  bars.  ±SE.  Insert  =  representative  Western 
blot.  C,  IRF-1  mRNA  as  measured  by  RNase  protection 
(**P  =  0.005,  three  independent  replicate  experiments) 
and  expressed  in  phosphorimager  units;  bars,  ±SE. 
Insert  =  representative  analysis;  36B4  is  a  ribosomal  gene 
(loading  control). 


response  in  breast  cancer  patients,  the  magnitude  of  which  is  associ¬ 
ated  with  TAM  therapy  (38). 

The  altered  expression  of  cathepsin  D  is  consistent  with  our  data 
published  previously,  showing  increased  secretion  of  this  protein  in 
several  of  our  hormone-independent  MCF-7  variants  (42).  Cathepsin 
D  expression  in  breast  tumors  also  is  associated,  at  least  in  some 
studies,  with  a  poor  prognosis  (43).  HSP-27  expression  has  been 
implicated  in  refining  the  diagnosis  of  suspicious  fine-needle  aspirates 
of  breast  tissues  (44).  Vitamin  B12  binding  proteins  are  expressed  in 
breast  tumors  (45),  and  vitamin  B12  deficiency  is  a  likely  risk  factor 
for  breast  cancer  (46).  Altered  expression  of  the  n-ras-related  gene  is 
consistent  with  the  elevated  ras  signaling  reported  in  some  breast 
cancer  cell  lines  and  tumors  (47). 

SAGE  also  identified  genes  expressed  at  higher  levels  in  the 
parental,  anti  estrogen-responsive  cells  (MCF7/LCC1)  when  com¬ 
pared  with  MCF7/LCC9  cells.  These  include  ferritin,  death-associated 
protein-6,  and  the  eukaryotic  elongation  factor- y.  Ferritin  is  expressed 
in  breast  cancers,  and  breast  tumor-derived  ferritin  may  be  a  more 
useful  tumor  marker  than  measuring  levels  of  ferritin  in  serum  (48). 

Structure  of  the  Gene  Microarray  Data.  It  has  been  suggested 
that  the  cost  required  to  perform  gene  microarray  studies  can  be 
reduced  by  combining  RNA  populations  from  several  replicates  and 
performing  a  single  hybridization  on  an  Atlas  array  (16).  However,  we 
found  heterogeneity  among  replicate  experiments,  which  often  re¬ 
mained  after  normalization.  Logarithmic  transformation  of  these  data 
reduced  this  heterogeneity  but  not  to  the  point  where  a  single  replicate 
could  be  used  to  obtain  an  adequate  description  of  the  data.  Conse¬ 
quently,  multiple  replicates  are  required  to  provide  a  more  reliable 
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Fig.  3.  Basal  transcriptional  activity  of  NFkB  and  CRE  in  MCF7/LCC1  and  MCF7/ 
LCC9  cells.  A,  NFkB.  B,  CRE.  Data  represent  mean  and  arc  expressed  as  fold  induction 
relative  to  MCF7/LCC1;  bars.  ±SE.  All  cells  were  grown  in  the  absence  of  estrogens 
(CCS-IMEM). 
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Tabic  3  Representative  list  of  differentially  expressed  genes  identified  by  gene 
microarray  analyses 


Gene" 

Unigene  no. 

MCF7/LCC1* 

MCF7/LCC9 

Gene  function 

NFkB 

Hs.75569 

1 

2 

Transcription  factor  involved  in 
cell  survival  signaling 

SOD 

Hs.75428 

1 

2 

Enzyme  involved  in 

detoxifying  oxygen  radicals 

EGR-I 

Hs.326035 

3 

1 

Transcription  factor 

EGFR 

Hs.77432 

2 

1 

Growth  factor  receptor 

IRF-1 

Hs.80645 

2 

1 

Transcription  factor  involved  in 
signaling  to  cell  cycle  arrest 
and  apoptosis 

TNFa 

Hs.241570 

2 

1 

Cytokine 

TNF-R1 

Hs.159 

2 

1 

Cytokine  receptor  involved  in 
signaling  to  apoptosis 

"Abbreviations  are  SOD,  superoxide  dismutasc;  TNF-R1,  tumor  necrosis  factor- 
receptor  1. 

h  Data  arc  represented  as  level  of  expression  relative  to  the  other  cell  line.  Data  arc 
based  on  the  mean  values  for  each  gene  (6  microarrays  of  MCF7/LCCI;  5  microarrays  of 
MCF7/LCC9).  Values  are  expressed  to  the  nearest  integer. 


estimate  of  the  putative  gene  expression  profiles.  These  observations 
on  filter  microarrays  are  consistent  with  recent  reports  for  glass 
slide -based  and  oligonucleotide  array-based  gene  expression  micro- 
arrays  (49,  50). 

Fig.  1 A  is  a  visual  representation  of  the  multidimensional  data  (597 
dimensions)  in  three  dimensions.  This  visualization  allows  for  an 
inspection  of  the  data  structure,  and  the  likely  comparability  of  the 
replicates  among  each  other  and  between  the  two  experimental  groups 
(antiestrogen-responsive  MCF7/LCC1  and  antiestrogen-resistant 
MCF7/LCC9).  For  this  top  level  visualization,  the  replicate  gene 
expression  profiles  for  MCF7/LCC1  and  MCF7/LCC9  exist  within 
linearly  separable  regions  of  the  gene  expression  data  space  after 
elimination  of  one  outlier  array  from  each  experimental  group.  The 
top  three  principal  components  capture  81.2%  of  the  cumulative 
variance  in  the  data  (597  dimensions).  Thus,  the  data  structure  is 
consistent  with  differences  in  the  gene  expression  profiles  as  predicted 
by  the  known  differential  antiestrogen  responsiveness  of  the  two 
variants. 

Genes  Implicated  by  Gene  Microarray  Studies.  The  data  in 
Table  3  show  the  fold- differences  in  expression  of  selected  genes 
identified  in  the  Clontech  Atlas  gene  microarray  studies  selected  using 
the  criteria  described  in  “Materials  and  Methods.”  The  selection  was 
not  intended  to  describe  fully  the  data  set,  only  to  assist  in  an  initial 
exploration  of  the  data.  This  small  but  rational  subset  of  genes  could 
be  additionally  evaluated  in  focused  studies  to  confirm  the  differential 
expression  patterns  and  establish  potential  functional  relevance.  Fur¬ 
thermore,  if  members  of  this  subset  were  truly  differentially  ex¬ 
pressed,  we  could  begin  to  understand  how  cells  perceive  antiestro¬ 
gens  and  adapt  to  this  selective  pressure. 

To  determine  whether  these  genes  are  broadly  representative  of  the 
differences  between  the  gene  expression  profiles  of  MCF7/LCC1  and 
MCF7/LCC9  cells,  we  generated  a  three-dimensional  projection  from 
the  seven-dimensional  gene  expression  data  space  (Fig.  15).  This  was 
necessary  because  we  used  several  criteria  to  construct  the  subset, 
including  some  genes  where  fold-regulation  or  distribution  of  the  data 
were  given  more  weight  than  formal  statistical  significance.  Conse¬ 
quently,  we  could  not  assume  that  we  had  maintained  the  linear 
separability  of  the  data,  at  the  top  level,  as  seen  in  all  597  dimensions. 

We  might  not  expect  this  small  subset  of  expression  data  (<2%  of 
the  information)  to  prove  as  effective  in  representing  the  respective 
phenotypes  as  the  full  data  set  (597  genes).  Nonetheless,  as  for  the 
597-dimension  visualization,  after  elimination  of  outlier  data  the 
seven-dimensional  MCF7/LCC1  and  MCF7/LCC9  profiles  remain  in 
linearly  separable,  three-dimensional  data  space.  The  top  three  prin¬ 
cipal  components  capture  98.9%  of  the  cumulative  variance  in  the 


data  (seven-dimensions).  This  observation  suggests  that  these  data 
contain  information  that  contributes  to  the  differences  in  the  molec¬ 
ular  profiles  of  these  two  variants,  that  these  genes  may  contribute  to 
the  respective  biological  phenotypes,  and  that  additional  studies  of 
their  potential  functional  relevance  are  warranted. 

Genes  expressed  at  a  higher  level  in  the  MCF7/LCC 1  cells  include 
EGF-R,  EGR-1,  IRF-1,  and  both  TNFa  and  its  R1  receptor  (TNF-R1). 
A  well-established  inverse  relationship  exists  between  the  expression 
of  EGF-R  and  ER  in  breast  tumors  (5 1).  EGF-R  can  induce  expression 
of  EGR-1  (52),  and  expression  of  both  genes  is  lower  in  MCF-7/LCC9 
cells.  EGR-1  is  a  transcription  factor  with  proapoptotic  activity  (53) 
that  can  block  NFkB  function  (54)  and  repress  TGF-/3  receptor 
expression  (29).  EGR-1  expression  is  down-regulated  in  7,12-dimeth- 
ylbenz(a)anthracene-induced  mammary  adenocarcinomas  in  rats  (55). 
IRF-1  is  an  IFN-regulated  transcription  factor  that  functions  as  a 
tumor  suppressor  gene  (56,  57)  and  is  induced  by  TNFa  (58).  A 
TNFa-mediated  pathway  for  signaling  to  apoptosis  occurs  in  MCF-7 
human  breast  cancer  cells  (59,  60),  and  measuring  serum  TNF  con¬ 
centrations  may  be  a  useful  prognostic  marker  in  breast  cancer  pa¬ 
tients  (61).  Furthermore,  HER-2/new  can  block  resistance  to  TNFa- 
induced  apoptosis  in  breast  cancer  cells,  using  a  mechanism  that 
involves  activation  of  NFkB  (62).  We  have  previously  implicated 
overexpression  of  superoxide  dismutase  in  resistance  to  TNFa  in 
MCF-7  cells  (63).  Superoxide  dismutase  appears  to  be  up-regulated  in 
MCF7/LCC9  cells  (Table  3)  and  in  TAM-stimulated  MCF-7  xe¬ 
nografts  (64).  NFkB  (p65/RelA)  appears  expressed  at  higher  levels  in 
MCF7/LCC9  cells.  NFkB  is  overexpressed  in  ER-negative  breast 
cancer  cells  (65)  and  has  an  important  role  in  the  development  of  the 
normal  mammary  gland  (66). 

NPM,  EGF-R,  and  IRF-1  Are  Differentially  Expressed  in 
MCF7/LCC1  and  MCF7/LCC9  Cells.  The  data  in  Table  2  and 
Table  3  predict  differential  expression  of  NPM,  EGF-R,  and  IRF-1 
between  MCF7/LCC1  and  MCF7/LCC9  cells.  To  confirm  these  ob¬ 
servations,  we  measured  the  levels  of  the  EGF-R  (immunofluores¬ 
cence)  and  NPM  proteins  (Western  blot)  and  IRF-1  mRNA  (RNase 
protection).  The  data  in  Fig.  2 A  show  that  MFC7/LCC9  cells  express 
lower  amounts  of  EGF-R  than  MCF-7/LCC1  cells.  NPM  protein 
expression  is  significantly  increased  in  MCF7/LCC9  cells  compared 
with  MCF7/LCC1  cells  (Fig.  25;  P  <  0.02),  consistent  with  the 
predicted  data  from  the  SAGE  analyses  (Table  2)  and  our  previous 
studies  (23,  38).  The  higher  levels  of  IRF-1  mRNA,  seen  in  the 
antiestrogen-responsive  MCF7/LCC1  cells  in  Table  3,  are  confirmed 
by  RNase  protection  analysis  (Fig.  2C;  P  =  0.005).  Both  the  gene 
microarray  and  RNase  protection  analyses  show  an  ~2-fold  higher 
level  of  IRF-1  expression  in  MCF7/LCCI  cells,  when  compared  with 
the  antiestrogen-resistant  MCF7/LCC9  cells. 

Transcriptional  Regulatory  Activities  of  NFkB  and  CRE  Are 
Increased  in  MCF7/LCC9  Cells.  The  increased  expression  of  NFkB 
(gene  expression  microarray)  and  XBP-1  (SAGE)  imply  increased 
transcriptional  activation  of  promoters  containing  NFkB  and  CRE 
response  elements,  respectively.  We  confirmed  these  observations 
directly,  using  commercially  available  promoter-reporter  assays  to 
measure  transcriptional  activities.  The  data  in  Fig.  3  show  that  the 
basal  activity  of  both  promoters  is  increased  in  MCF7/LCC9  cells; 
~  10-fold  for  NFkB  and  4-fold  for  CRE  ( P  <  0.02).  The  increase  in 
transcriptional  activation  of  the  NFkB  constructs  is  greater  than  that 
predicted  by  the  gene  array  data,  but  mRNA,  protein,  and  protein/ 
DNA  binding  activities  can  be  poor  predictors  of  the  functional 
activation  of  some  transcription  factors  (67).  This  prediction  is  not 
problematic  for  XBP-1,  where  the  4-fold  increase  in  mRNA  expres¬ 
sion  identified  by  SAGE  (Table  2)  compares  well  with  the  4-fold 
increase  in  basal  transcriptional  activation  (Fig.  35). 

We  next  assessed  whether  ICI  182,780,  the  antiestrogen  used  to 
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Fig.  4.  Regulation  of  NFkB  and  CRE  transcription  by  ICI  f  82,780  in  MCF7/LCC1  and 
MCF7/LCC9  cells.  A,  NFkB  (*P  <  0.001,  MCF7/LCC1  versus  MCF7/LCC9).  B.  CRE 
(not  significant).  NFkB  and  CRE  data  are  represented  as  mean  of  transcriptional  activation 
expressed  as  a  percentage  of  controls  (vehicle-treated  cells  of  the  same  cell  line); 
bars,  i:SE.  Cells  were  grown  in  CCS-IMEM  and  treated  with  10  hm  1C1  1 82,780  for  48  h 
before  measuring  reporter  gene  expression. 


generate  the  MCF7/LCC9  cells,  could  regulate  the  transcriptional 
activities  of  NFkB  and  CRE.  Whereas  ICI  182,780  inhibits  NFkB 
activity  in  the  MCF7/LCC1  cells  (TAM-  and  ICI  182,780-respon- 
sive),  this  regulation  is  lost  in  the  TAM  and  ICI  182,780  cross- 
resistant  MCF7/LCC9  cells  (Fig.  4A).  In  contrast,  ICI  182,780  treat¬ 
ment  does  not  alter  the  transcriptional  regulatory  activities  of  the  CRE 
promoter  in  any  of  these  variants  (Fig.  4 B). 

MCF7/LCC9  Cells  Are  Specifically  Responsive  to  an  Inhibitor 
of  NFkB  Activity.  The  increased  activation  of  NFkB  and  loss  of  its 
estrogenic  regulation  in  MCF7/LCC9  cells  suggests  that  these  cells 
might  now  be  partly  dependent  on  NFkB  signaling  for  survival/ 
growth.  Consequently,  we  compared  the  growth  response  of  MCF7/ 
LCC1  and  MCF7/LCC9  cells  to  parthenolide,  a  potent  and  specific 
inhibitor  of  NFkB  that  can  inhibit  the  inhibitor  of  NFkB  kinase 
repressor  of  NFkB  (68,  69)  and  also  binds  NFkB  in  a  highly  ste¬ 
reospecific  manner  to  block  DNA  binding  (70).  Parthenolide  produces 
a  dose-dependent  inhibition  of  MCF7/LCC9  cells,  with  an  apparent 
IC50  of  ~600  nM  (Fig.  5).  In  contrast,  parthenolide  does  not  signifi¬ 
cantly  affect  growth  of  MCF7/LCC1  cells  at  these  concentrations. 
MCF7/LCC9  cells  are  significantly  more  dependent  on  the  transcrip¬ 
tional  regulatory  activities  of  NFkB  than  their  ICI  182,780-responsive 
parental  cells  (P  <  0.01  forMCF7/LCC9  versus  MCF7/LCC1  at  both 
300  nM  and  600  nM  parthenolide). 

DISCUSSION 

We  have  begun  to  identify  the  molecular  changes  associated  with 
cell  survival  after  prolonged  ICI  182,780  treatment  in  breast  cancer 
cells.  Whereas  we  have  not  attempted  to  confirm  the  altered  expres¬ 
sion  of  all  implicated  genes,  some  expression  patterns  are  consistent 
with  the  activities  we  have  confirmed.  Here  we  discuss  only  those 
genes  for  which  altered  mRNA,  protein,  and/or  transcriptional  acti¬ 
vation  have  been  confirmed,  and  that  are  known  to  interact  with  each 
other  in  various  cellular  models,  i.e.,  IRF-1,  NPM,  NFkB,  and  CRE. 


IRF-1  can  function  as  a  tumor  suppressor  and  can  signal  to  apo¬ 
ptosis  through  both  p53-dependent  and  p53-independent  pathways 
(71).  These  observations  may  partly  reflect  the  ability  of  IRF-1  to 
induce  a  caspase  cascade  through  activation  of  either  caspase  1  (ICE; 
Ref.  72)  and/or  caspase  7  (73).  Caspase  1  is  involved  in  regulating 
apoptosis  in  normal  mammary  epithelial  cells  (74),  and  overexpres¬ 
sion  of  caspase  1  is  lethal  in  MCF-7  human  breast  cancer  cells  (75). 
Preliminary  data  from  our  laboratory  demonstrate  that  overexpression 
of  IRF-1  inhibits  anchorage-dependent  colony  formation  and  that  the 
rate  of  cell  proliferation  in  MCF-7  cells  is  inversely  related  to  the  level 
of  IRF-1  expression  (76).  These  data  suggest  that  the  down-regulation 
of  IRF-1  in  MCF7/LCC9  cells  may  protect  these  cells  from  IRF-1  - 
induced  inhibition  of  proliferation  and/or  induction  of  apoptosis. 

NPM  can  function  as  an  oncogene,  its  overexpression  fully  trans¬ 
forming  NIH  3T3  cells  in  a  standard  assay  for  oncogenic  potential 
(77).  We  have  shown  that  levels  of  autoantibodies  to  NPM  increase  in 
breast  cancer  patients  6  months  before  their  recurrence.  Consistent 
with  an  estrogenic/antiestrogenic  regulation  of  NPM,  the  levels  of 
these  autoantibodies  are  lower  in  breast  cancer  patients  that  have 
received  TAM  (38).  The  increased  NPM  expression  in  MCF7/LCC9 
cells  compared  with  MCF7/LCC1  cells  may  reflect  oncogenic  poten¬ 
tial  of  NPM,  an  activity  potentially  related  to  its  ability  to  inhibit 
IRF-1  function  (see  below). 

NFkB  has  been  implicated  in  resistance  to  cytotoxic  drugs  and  can 
function  as  a  survival  factor  in  various  cell  types  (78).  Several  aspects 
of  normal  mammary  gland  development  appear  dependent  on  NFkB 
activity  (66),  perhaps  partly  reflecting  its  estrogenic  regulation  (65). 
Elevated  NFkB  activity  arises  early  during  neoplastic  transformation 
in  the  rat  mammary  gland  (79).  Widely  expressed  in  breast  cancer 
cells  and  tumors,  elevated  NFkB  activity  is  associated  with  estrogen- 
independence  (65,  66).  Currently,  NFkB  is  the  only  protein  known 
to  induce  BRCA2  expression  (80).  ICI  182,780  cannot  suppress  the 
increased  NFkB  activity  in  MCF7/LCC9  cells,  despite  inhibiting 
this  function  in  ICI  182,780-responsive  cells  (MCF7/LCC1).  The 
functional  relevance  of  this  observation  was  tested  directly  using 
parthenolide,  which  both  specifically  binds  NFkB  and  blocks 
degradation  of  the  endogenous  NFkB  inhibitor  IkB,  resulting  in 
the  inhibition  of  NFkB  transcriptional  regulatory  activities  (68, 
70).  This  activity  of  parthenolide  has  been  used  to  evaluate  the 
functional  role  of  NFkB  in  several  recent  studies  (68,  69,  81,  82). 
MCF7/LCC9  cells  are  significantly  more  sensitive  to  growth  inhibi¬ 
tion  by  parthenolide  than  their  MCF7/LCC1  parental  cells.  This 


Fig.  5.  Response  to  inhibition  of  NFkB  activity  by  parthenolide.  Data  represent  mean 
of  four  determinations,  where  absorbance  in  each  treated  population  is  expressed  as  a 
percentage  of  the  absorbance  in  control  cells  (vehicle  treated  cells  of  the  same  cell  line). 
*P  £  0.01  MCF-7/LCC1  versus  MCF7/LCC9.  Cells  were  grown  in  CCS-IMEM  without 
(control;  vehicle  only)  or  with  parthenolide  supplementation  (300  nM;  600  nM). 
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observation  is  consistent  with  a  greater  functional  reliance  on  NFkB 
activation  for  cell  growth/survival,  and  implies  that  one  option  for 
surviving  antiestrogen  exposure  is  the  up-regulation  of  an  estrogen- 
regulated  survival  factor(s)  concurrent  with  the  loss  of  its  ER- 
mediated  regulation.  Furthermore,  parthenolide  is  now  in  clinical 
trials,  and  our  data  suggest  that  it  may  prove  useful  in  combination 
with  Faslodex  or  other  antiestrogens  to  either  increase  responsiveness 
and/or  delay  the  appearance  of  resistant  disease. 

XBP-1  has  been  identified  recently  in  clusters  of  genes  associ¬ 
ated  with  ER-positive  breast  tumors  in  two  independent  studies 
(13,  41),  and  its  expression  is  increased  in  MCF7/LCC9  cells. 
XBP-1  is  a  transcription  factor  that  binds  and  activates  CRE  (39). 
The  importance  of  CRE-regulated  events  is  widely  reported  in 
many  cell  types  (83,  84).  These  events  include  a  likely  role  in 
signal  transduction  either  at  or  downstream  of  ER  and  PgR  (85). 
The  relevance  of  increased  CRE  activity  in  MCF7/LCC9  cells  is 
additionally  supported  by  recent  evidence  that  CRE-decoy  oligo¬ 
nucleotides  inhibit  the  growth  of  MCF-7  cells  (86).  We  detected  a 
4-fold  increase  in  CRE  transcriptional  activation  in  MCF7/LCC9 
cells.  Importantly,  1CI  182,780  cannot  regulate  CRE  activity 
in  either  MCF7/LCC1  (ICI  182,780-responsive)  or  MCF7/LCC9 
(resistant)  cells.  These  data  imply  an  additional  option  available 
to  breast  cancer  cells,  a  switch  to  signaling  pathways  that  are 
normally  independent  of  ER-mediated  signaling. 

IRF-1,  NPM,  NFkB,  and  CRE  are  known  to  affect  cell  prolifera¬ 
tion,  apoptosis,  and/or  carcinogenesis.  Two  critical  protein-protein 
interactions  directly  link  the  IRF-1,  NFkB,  and  NPM  proteins.  Direct 
binding  occurs  between  IRF-1  and  NPM  (77),  and  between  IRF-1  and 
NFkB  (87,  88).  In  both  cases,  the  interactions  with  IRF-1  have 
important  effects  on  gene  transcription  and  cell  signaling.  NPM  bind¬ 
ing  inhibits  the  transcription  regulatory  activities  of  IRF-1  (77).  A 
coordinated  perturbation  in  the  regulation  of  these  two  genes  has 
occurred  in  the  MCF7/LCC9  cells;  NPM  is  up-regulated  and  IRF-1  is 
down-regulated.  Thus,  overexpression  of  NPM  could  additionally 
reduce  the  remaining  lower  levels  of  IRF-1,  potentially  blocking/ 
eliminating  its  ability  to  initiate  an  apoptotic  caspase  cascade  through 
caspase  1  and/or  caspase  7.  Such  an  effect  would  likely  also  eliminate 
the  ability  of  IRF-1  to  induce  p21c,p1/wafl  (89)  and  cooperate  with 
wild-type  p53  in  signaling  to  apoptosis  (56,  57).  Changes  in  the 
amount  of  available  IRF-1  will  directly  affect  the  number  of  IRF-I: 
NFkB  heterodimers  available  to  regulate  an  additional  series  of  genes. 
Whereas  NFkB  will  compete  with  NPM  for  IRF-1  binding,  their 
relative  affinities  for  IRF-1  are  unknown,  and  the  preferred  IRF-1 
heterodimer  remains  to  be  established.  IRF-1  :NFkB  protein-protein 
interactions  or  other  cooperative  interactions  are  implicated  in  the 
induction  of  ATF-2/jun  (90),  RANTES  (91),  VCAM-l  (88),  inter¬ 
leukin  6  (92),  and  MHO  class  1  genes  (87).  A  functional  IFN-/3 
enhanceosome  has  been  described  that  includes  IRF-1,  NFkB,  and 
ATF2/jun  (93).  The  importance  of  both  IRF-1  and  NFkB  in  IFN- 
induced  signaling  may  contribute  to  the  ability  of  IFNs  to  increase 
responses  to  antiestrogens  (94-96). 

CRE  activation  also  may  interact  with  the  pathways  regulated  by 
IRF-1,  NFkB,  and  NPM  interactions.  Delgado  et  at  (97)  described  a 
cyclic  AMP-dependent  pathway  that  inhibits  IRF-1  transactivation. 
Thus,  the  increased  CRE  activity  in  MCF7/LCC9  cells  may  explain, 
in  part,  the  lower  IRF-1  mRNA  levels  seen  both  in  the  gene  expres¬ 
sion  arrays  and  in  the  IRF-1  RNase  protection  studies. 

The  concurrent  changes  in  NPM,  IRF-1,  NFkB,  and  CRE  suggest 
a  novel  integrated  signaling  pathway  that  may  involve  the  ability  of 
NPM  and  CRE  to  inhibit  IRF-I  initiation  of  a  caspase  cascade  to 
apoptosis,  the  altered  ability  of  cells  to  induce  genes  dependent  on ' 
IRF-1  :NFkB,  and  an  increased  activation  of  survival  pathways  that 
involve  both  NFkB  and  CRE.  Studies  to  additionally  establish  the 


nature,  function,  and  regulation  of  this  putative  path^y  are  currently 
in  progress,  including  an  overexpression  of  NFkB  in  sensitive  cells 
and  a  dominant-negative  approach  in  resistance  cells.  Because  we 
looked  only  at  cells  that  survived  long-term  antiestrogen  exposure,  the 
ability  of  the  changes  implicated  in  the  present  study  to  protect  from 
an  initial  or  short  term  exposure  have  yet  to  be  determined.  For 
example,  cells  may  or  may  not  survive  an  initial  antiestrogenic  expo¬ 
sure  by  the  same  mechanisms  that  allow  for  long-term  survival. 
Irrespective  of  whether  these  other  genes  are  functionally  involved, 
their  patterns  of  expression  may  be  important  in  better  predicting  the 
25%  of  ER+/PgR+,  55%  of  ER-/PgR+,  and  66%  of  ER+/PgR— 
breast  tumors  that  do  not  respond  to  antiestrogens  (2). 

It  is  not  possible,  in  a  single  focused  study,  to  define  all  of  the 
potentially  differentially  expressed  genes  nor  to  establish  their  func¬ 
tional  relevance  firmly.  Because  the  number  of  cellular  models  stud¬ 
ied  is  small,  additional  functional  studies  where  expression  of  the 
candidate  genes  is  induced  or  repressed  are  in  progress.  Nonetheless, 
our  data  imply  that  breast  cancer  cells  have  highly  plastic  transcrip- 
tomes,  with  access  to  several  signal  transduction  pathways  for  regu¬ 
lating  the  choice  to  differentiate,  proliferate,  or  die.  For  example, 
MCF7/LCC9  cells  have  taken  several  possible  interactive/interdepen¬ 
dent  approaches  to  circumvent  the  growth  inhibitory  effects  of  anties¬ 
trogens.  This  plasticity  in  gene  expression  patterns  is  consistent  with 
the  marked  heterogeneity  apparent  in  the  clinical  disease  (2,  98). 

In  summary,  our  data  suggest  that  one  molecular  profile  associated 
with  surviving  prolonged  antiestrogen  exposure  may  include  loss  of 
ER-mediated  signaling  to  apoptosis  through  IRF-1 .  This  lost  signaling 
is  achieved  both  by  down-regulation  of  IRF-1  and  a  coordinated 
up-regulation  of  its  inhibitor  NPM,  and  possibly  another  protein 
partner  NFkB.  Up-regulation  of  CRE  activities  also  is  implicated 
in  this  molecular  profile.  Other  patterns  of  gene  expression  may 
provide  alternative  routes  to  the  resistant  phenotype  or  in  cells  that 
acquire  a  TAM-stimulated  phenotype  (2).  The  identification  of  these 
molecular  profiles  and  signaling  pathways  may  ultimately  allow  us  to 
understand  ER-regulated  signaling,  facilitate  the  development  of  novel 
treatment  strategies,  and  allow  clinicians  to  better  identify  antiestrogen- 
responsive  and  -resistant  breast  tumors. 
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ABSTRACT 

Purpose:  Gene  expression  microarray  technologies  have 
the  potential  to  define  molecular  profiles  that  may  identify 
specific  phenotypes  (diagnosis),  establish  a  patient’s  ex¬ 
pected  clinical  outcome  (prognosis),  and  indicate  the  likeli¬ 
hood  of  a  beneficial  effect  of  a  specific  therapy  (prediction). 
We  wished  to  develop  optimal  tissue  acquisition,  processing, 
and  analysis  procedures  for  exploring  the  gene  expression 
profiles  of  breast  core  needle  biopsies  representing  cancer 
and  noncancer  tissues. 

Experimental  Design:  Human  breast  cancer  xenografts 
were  used  to  evaluate  several  processing  methods  for  pro¬ 
spectively  collecting  adequate  amounts  of  high-quality  RNA 
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for  gene  expression  microarray  studies.  Samples  were  as¬ 
sessed  for  the  preservation  of  tissue  architecture  and  the 
quality  and  quantity  of  RNA  recovered.  An  optimized  pro¬ 
tocol  was  applied  to  a  small  study  of  core  needle  breast 
biopsies  from  patients,  in  which  we  compared  the  molecular 
profiles  from  cancer  with  those  from  noncancer  biopsies. 
Gene  expression  data  were  obtained  using  Research  Genet¬ 
ics,  Inc.  NamedGenes  cDNA  microarrays.  Data  were  visu¬ 
alized  using  simple  hierarchical  clustering  and  a  novel  prin¬ 
cipal  component  analysis-based  multidimensional  scaling. 
Data  dimensionality  was  reduced  by  simple  statistical  ap¬ 
proaches.  Predictive  neural  networks  were  built  using  a 
multilayer  perceptron  and  evaluated  in  an  independent  data 
set  from  snap-frozen  mastectomy  specimens. 

Results:  Processing  tissue  through  RNA Later  preserves 
tissue  architecture  when  biopsies  are  washed  for  5  min  on 
ice  with  ice-cold  PBS  before  histopathological  analysis.  Cell 
margins  are  clear,  tissue  folding  and  fragmentation  are  not 
observed,  and  integrity  of  the  cores  is  maintained,  allowing 
optimal  pathological  interpretation  and  preservation  of  im¬ 
portant  diagnostic  information.  Adequate  concentrations  of 
high-quality  RNA  are  recovered;  51  of  55  biopsies  produced 
a  median  of  1.34  jxg  of  total  RNA  (range,  100  ng  to  12.60 
p,g).  Snap-freezing  or  the  use  of  RNAXater  does  not  affect 
RNA  recovery  or  the  molecular  profiles  obtained  from  bi¬ 
opsies.  The  neural  network  predictors  accurately  discrimi¬ 
nate  between  predominantly  cancer  and  noncancer  breast 
biopsies. 

Conclusions:  The  approaches  generated  in  these  studies 
provide  a  simple,  safe,  and  effective  method  for  prospec¬ 
tively  acquiring  and  processing  breast  core  needle  biopsies 
for  gene  expression  studies.  Gene  expression  data  from  these 
studies  can  be  used  to  build  accurate  predictive  models  that 
separate  different  molecular  profiles.  The  data  establish  the 
use  and  effectiveness  of  these  approaches  for  future  prospec¬ 
tive  studies. 

INTRODUCTION 

The  emerging  gene  microarray  technologies  provide  pow¬ 
erful  new  methodologies  with  which  to  address  several  impor¬ 
tant  issues  in  breast  cancer  research.  For  example,  it  should  be 
possible  to  define  gene  expression  patterns  that  can  identity 
specific  phenotypes  (diagnosis),  establish  a  patient's  expected 
clinical  outcome  (prognosis),  and  indicate  the  likelihood  of  a 
beneficial  effect  of  a  specific  therapy  (prediction;  Refs.  1  and  2). 
Gene  microarray  technologies  are  performed  on  chips,  glass 
slides,  or  filters  and  allow  the  comparison  of  gene  expression 
profiles  from  two  or  more  tissues  or  the  same  tissue  in  different 
biological  states  (3).  The  technologies  continue  to  develop,  with 
considerable  discussion  regarding  which  technology  has  the 
greatest  potential  to  address  the  molecular  profiling  of  tumors. 
Each  of  the  major  approaches  has  advantages  and  disadvan- 
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tages,  but  the  most  important  consideration  is  the  ability  of  the 
technology  to  address  the  chosen  hypothesis  (4).  Overall,  there 
is  no  compelling  evidence  of  major  differences  in  the  accuracy 
or  reproducibility  of  the  various  microarray  platforms  (4-6). 
Studies  that  directly  compare  the  nylon-based  cDNA  arrays  with 
either  glass  slide  cDNA  arrays  and/or  oligonucleotide  chips 
consistently  report  that  these  platforms  produce  comparable  data 
(5-8). 

Because  gene  expression  technologies  provide  an  as¬ 
sessment  of  mRNA  abundance  in  a  sample,  all  require  the 
production  of  a  probe,  labeled  with  either  a  radioactive 
nucleotide  or  fluorescent  molecule,  generated  from  either  the 
total  or  polyadenylate  RNA  isolated  from  the  sample.  Cur¬ 
rently,  it  is  not  possible  to  isolate  adequate  concentrations  of 
high-quality  RNA  from  what  would  otherwise  be  the  most 
abundant  source:  the  formalin-fixed,  paraffin-embedded  tu¬ 
mor  specimens  available  in  established  tumor  banks.  Only 
fresh  or  appropriately  frozen  tissues  provide  the  necessary 
quality  of  RNA  for  the  preparation  of  probes  to  hybridize  to 
existing  gene  expression  microarrays. 

Whereas  many  institutions  have  frozen  tumor  banks,  these 
may  be  of  limited  use  in  obtaining  reproducible  gene  expression 
profiles  for  some  breast  cancers.  For  example,  most  are  heavily 
biased  toward  large  breast  tumors  (T3-T4).  These  tumors  are 
poorly  representative  of  the  small  tumors  now  seen  in  many 
patients  for  initial  diagnosis  (9).  A  further  concern  with  existing 
frozen  tissue  banks  is  the  frequent  lack  of  a  standardized  ap¬ 
proach  for  tissue  acquisition  and  processing.  Tissue  handling 
between  excision  and  freezing  can  vary  considerably.  For  ex¬ 
ample,  some  tumors  are  frozen  within  seconds  of  excision,  and 
others  are  placed  on  wet  or  dry  ice  after  excision,  whereas  some 
may  stand  for  many  minutes  at  room  temperature  before  being 
placed  in  liquid  nitrogen.  The  importance  of  tissue  processing  is 
often  critical  for  assessing  various  end  points  and  can  affect  both 
RNA  stability  for  RNA  in  situ  hybridizations  and  antigen  sta¬ 
bility/accessibility  for  immunohistochemistry  (10). 

The  effect  of  tissue  acquisition  and  processing  on  gene 
microarray  data  has  not  been  widely  addressed.  Nonetheless, 
this  is  likely  to  be  important  for  at  least  two  critical  parameters. 
First  is  preservation  of  high-quality  RNA.  Most  investigators 
acknowledge  the  importance  of  using  only  pure,  high-quality 
RNA  for  gene  microarray  studies  (11).  The  second  factor  is 
maintenance  of  a  tissue’s  gene  expression  profile.  For  example, 
hypoxia-  or  stress-induced  responses  can  be  induced  in  meta- 
bolically  active  cells.  Oxygen  deprivation  begins  with  the  loss  of 
tissue  perfusion  occurring  upon  excision.  This  deprivation  can 
trigger  a  hypoxic  response,  characterized  by  the  altered  expres¬ 
sion  of  specific  genes  (12,  13).  Several  of  these  genes  are 
transcription  factors  that  further  affect  the  expression  of  their 
target  genes  (13). 

One  problem  with  these  two  factors  is  that  both  can  affect 
a  sample,  but  RNA  could  still  be  obtained,  a  probe  could  still  be 
generated,  and  a  molecular  profile  could  still  be  obtained  after 
hybridization  to  a  gene  expression  microarray.  Subtle  changes 
that  are  time,  temperature,  pH,  and/or  oxygen  dependent  could 
occur  with  sufficient  variability  that  they  are  almost  impossible 
to  detect  reproducibly.  Some  tumors  with  high  metabolic  activ¬ 
ity  may  be  more  sensitive  to  hypoxia,  producing  a  statistically 
valid  and  biologically  plausible  clustering  that  could  have  re- 


Tabie  1  Experimental  conditions  for  xenograft  study 


All  samples  were  processed  in  duplicate. 


Temperature  of  72  h  storage 

Wash  solution 

Wash  time 

4°C 

1:6  (RNALtfterrPBS) 

5  min 

4°C 

1:9  (RNALdrterrPBS) 

5  min 

4°C 

1:12  (RNALtf/erPBS) 

5  min 

4°C 

PBS 

5  min 

4°C 

No  wash 

No  wash 

4°C 

1:6  (RNAXatenPBS) 

120  min 

4°C 

1:9  (RN Abater:  PBS) 

120  min 

4°C 

1:12  (RN AXtfter:  PBS) 

120  min 

4°C 

PBS 

120  min 

Room  temperature 

No  wash 

No  wash 

-20  °C 

No  wash 

No  wash 

suited  more  from  tissue  processing  rather  than  tissue  biology. 
Where  such  changes  are  subtle,  expression  profiles  might  still 
appear  grossly  similar,  complicating  an  assessment  of  tissue 
processing  artifacts. 

Given  the  bias  of  existing  banks  and  the  potential  differ¬ 
ences  in  tissue  processing,  many  important  questions  in  breast 
cancer  biology  may  require  prospective  study  designs.  Such 
study  designs  are  more  valid  for  the  exploration  or  validation  of 
new  predictive  and  prognostic  factors.  Whereas  optimized  tissue 
acquisition  and  processing  strategies  for  prospective  studies 
offer  the  opportunity  for  greater  control  of  tissue  quality  than 
retrospective  studies,  these  strategies  have  not  been  described. 
In  this  study,  we  wished  to  develop  a  standard  tissue  acquisition/ 
processing  method  for  prospective  core  needle  breast  biopsy 
sampling.  This  method  should  avoid  the  initial  use  of  liquid 
nitrogen,  preserve  tissue  architecture,  and  provide  adequate  con¬ 
centrations  of  high-quality  RNA '  for  microarray  analysis.  We 
now  report  a  simple  tissue  processing  approach  using  a  com¬ 
mercially  available  reagent  (RN Abater)  that  is  applicable  to 
prospective  studies  on  core  needle  biopsies.  RNA  obtained  from 
this  approach  was  compared  with  RNA  from  snap-frozen  human 
breast  biopsies  of  neoplastic  and  nonneoplastic  tissues,  gene 
expression  microarray  data  were  obtained,  and  an  accurate  neu¬ 
ral  network  capable  of  discriminating  between  these  tissues  was 
built  and  validated  in  an  independent  data  set. 

MATERIALS  AND  METHODS 

Breast  Cancer  Xenograft  Studies.  MDA-MB-231 
cells  were  inoculated  into  athymic  nude  mice  as  described 
previously  (14,  15).  Mice  were  sacrificed,  and  tumor  tissue 
was  obtained  using  sterile  scissors  and  forceps.  Needle  biop¬ 
sies  were  taken  from  the  excised  xenografts  and  placed  into 
separate  tubes  containing  0.5  ml  of  KNALater  (Ambion, 
Austin,  TX)  at  room  temperature.  Samples  were  stored  at 
various  temperatures  for  72  h  and  subsequently  processed 
according  to  the  scheme  in  Table  1.  Each  experimental  con¬ 
dition  was  explored  in  duplicate  samples.  Tissues  were  em¬ 
bedded  in  OCT  (BDH;  Poole,  Dorset,  United  Kingdom),  and 
standard  frozen  sections  were  prepared  from  each  sample. 
Subsequently,  sections  were  stained  with  H&E  and  evaluated 
by  the  study  pathologist.  The  remainder  of  the  core  was 
stored  at  -80°C,  and  total  RNA  was  extracted  for  evaluation. 
All  animal  studies  were  performed  under  protocols  approved 
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by  the  Georgetown  University  Animal  Care  and  Use  Com¬ 
mittee. 

Patient  Population.  Patients  undergoing  a  diagnostic 
core  needle  or  excisional  biopsy  at  Georgetown  University 
Hospital  were  eligible  for  the  tissue  acquisition  protocol,  in 
which  additional  cores  were  obtained  for  study  purposes.  All 
patients  signed  a  written  consent  form  approved  by  the  Geor¬ 
getown  University  Medical  Center  Institutional  Review  Board. 
Core  biopsies  provided  by  the  radiologists  were  obtained  with 
either  mammographic  or  ultrasound  guidance.  Core  biopsies 
obtained  by  the  surgeons  were  obtained  either  after  surgical 
exposure  of  the  tumor  or  during  a  routine  needle  biopsy.  A  total 
of  1-4  cores  were  obtained  from  each  patient  for  study  pur¬ 
poses,  depending  on  the  size  of  the  breast  lesion.  In  addition, 
nine  frozen  breast  tumor  specimens  were  obtained  from  the 
Department  of  Oncology,  University  of  Edinburgh  (Edinburgh, 
Scotland,  United  Kingdom)  for  use  in  testing  the  neural  net¬ 
works  for  accuracy  in  identifying  tissues  as  malignant  or  non- 
malignant.  These  samples  were  collected  after  appropriate  pa¬ 
tient  consent  and  consistent  with  the  relevant  United  Kingdom 
legislation.  In  this  study,  the  pathologist  was  blinded  to  all 
clinical  information  on  all  samples. 

Collection  and  Handling  of  Human  Breast  Core  Biop¬ 
sies  for  Microarray  Analysis.  Generally,  1-4  core  needle 
biopsies  (14-gauge  needle)  were  obtained  from  each  consenting 
patient.  Random  cores  were  immediately  snap-frozen  in  liquid 
nitrogen;  others  were  individually  placed  in  separate  cryo-tubes 
containing  0.5  ml  of  RNA Later  solution.  Snap-frozen  tissues 
were  placed  directly  in  liquid  nitrogen  from  the  core  biopsy 
needle,  immediately  upon  removal  from  the  patient.  For  the 
KNALatei *  samples,  core  biopsies  were  placed  in  500  jxl  of 
KNAJLater  and  maintained  at  4°C  for  24  h  before  snap-freezing. 
Each  tube  was  labeled  with  the  patient’s  name,  hospital  number, 
and  study  number.  Frozen  samples  were  transferred  to  the 
Lombardi  Cancer  Center’s  Tissue  and  Histopathology  Shared 
Resource  (Washington,  DC)  for  processing. 

Before  removing  the  samples  from  the  tube  for  frozen 
section  preparation,  each  sample  was  washed  for  5  min  on  ice 
with  500  jjl!  of  ice-cold  sterile  PBS  (RNase  free);  otherwise, 
samples  in  RNA  Later  will  not  freeze  in  the  cryostat.  Each  core 
biopsy  sample  was  then  embedded  separately  in  an  OCT  block. 
A  frozen  section  was  taken,  stained  with  H&E,  and  examined  by 
the  study  pathologist.  OCT-embedded  samples  were  maintained 
frozen  at  —  70°C  until  the  analysis  of  the  main  tumor  mass  was 
complete. 

The  study  pathologist  evaluated  all  biopsies  to  determine 
the  presence  of  invasive  cancer  and  to  estimate  the  relative 
amounts  of  normal  epithelium,  stroma,  and  fat.  Because  samples 
were  to  be  used  for  microarray  analysis,  the  percentage  of 
invasive  cancer,  normal  epithelium,  stroma,  and  fat  was  esti¬ 
mated  relative  to  cell  nuclei  only.  Provided  this  histological 
review  offered  no  new  clinical  information  important  for  patient 
care,  biopsies  suitable  for  microarray  were  identified.  In  this 
manner,  tissue  for  expression  microarray  analysis  was  ensured 
to  be  of  no  new  diagnostic  relevance.  This  determination  is 
important  because  RNA  extraction  destroys  tissue  architecture. 
If  the  samples  had  contained  information  that  modified  the 
surgical  pathology  diagnosis,  these  biopsies  would  not  have 
been  used.  This  situation  did  not  occur  in  this  study. 


Once  released  for  study,  all  patient  identifiers  were  re¬ 
moved  from  each  sample.  The  link  between  patient  identifiers 
and  study  identifiers  was  held  in  a  confidential  database.  Access 
to  this  database  was  reserved  only  for  the  clinical  study  principal 
investigator  and  the  data  entry  technician.  The  frozen  clinical 
material,  mostly  frozen  in  OCT,  was  directly  provided  to  the 
research  laboratory  for  storage  and/or  processing.  Upon  receipt 
in  the  research  laboratory,  tissue  was  either  stored  at  —  80°C  or 
processed  immediately  for  RNA  extraction. 

Preparation  and  Quality  Assessment  of  RNA  from  Fro¬ 
zen  Tissues.  Frozen  tissue  was  placed  in  a  1  X  1-inch 
plastic  bag  on  dry  ice  and  pulverized,  and  lysis  buffer  from 
the  Qiagen  RNeasy  kit  was  added  (Qiagen,  Inc.,  Valencia, 
CA).  Each  sample  was  then  transferred  to  a  1 .5-ml  centrifuge 
tube,  homogenized  with  a  1-ml  syringe  and  an  18-gauge 
needle,  added  to  the  Qiagen  spin  column,  and  centrifuged  to 
bind  the  RNA  to  the  matrix.  The  column  was  washed  with  the 
buffers  provided  in  the  kit,  and  the  RNA  was  finally  eluted 
with  distilled  H20.  RNA  concentrations  were  determined  by 
comparing  the  absorbance  ratios  (A26onTnfA2so  nm)  obtained 
spectrophotometrically  using  a  Beckman  DU 640  Spectro¬ 
photometer  (Beckman,  Fullerton,  CA). 

Because  using  standard  gel  electrophoresis  to  assess  RNA 
quality  would  require  almost  the  entire  RNA  sample,  we  used  an 
Agilent  2100  analyzer  and  RNA  6000  LabChip  kits  (RNA 
microelectroseparation  and  analysis;  Agilent  Technologies, 
New  Castle,  DE).  A  total  of  100  ng  of  each  RNA  sample  was 
loaded/well.  The  analyzer  allows  for  visual  examination  of  both 
the  18S  and  28S  rRNA  bands  as  a  measure  of  RNA  integrity. 

Probe  Generation  for  Gene  Microarray  Hybridizations. 
Probes  were  generated  as  described  previously  (16).  This 
method  radiolabels  both  the  sense  and  antisense  probe  strands 
and  further  increases  probe-specific  activity  by  incorporating 
two  radiolabeled  nucleotides.  Thus,  tumors  can  be  arrayed  on 
nylon  filter  arrays  with  as  little  as  100  ng  of  total  RNA  and 
without  RNA  amplification  (7, 16).  Whereas  an  adequate  signal 
is  generated  with  100  ng  of  total  RNA,  the  use  of  very  low  RNA 
concentrations  will  likely  affect  the  ability  to  adequately  and 
reproducibly  detect  many  lower  abundance  mRNAs.  We  used 
500  ng  of  total  RNA,  which  is  sufficient  to  allow  the  use  of 
approximately  70%  of  breast  needle  biopsies  without  either 
RNA  amplification  or  pooling.  None  of  the  RNAs  was  amplified 
or  pooled  in  the  current  study. 

To  synthesize  the  labeled  cDNA  probe,  500  ng  of  total 
RNA  were  incubated  at  70°C  for  10  min  with  2  mg  of 
oligodeoxythymidylate  and  then  chilled  on  ice  for  2  min.  The 
primed  DNA  was  incubated  at  37°C  for  90  min  in  a  solution 
containing  IX  first  strand,  3  mM  DTT,  1  mM  dGTP/dTTP, 
300  units  of  reverse  transcriptase,  50  mCi  of  [33P]dCTP,  and 
50  mCi  of  [33P]dATP.  The  second  strand  was  synthesized  by 
adding  IX  reaction  buffer,  100  units  of  DNA  polymerase  I, 
500  ng  of  random  primers,  1  mM  dGTP/dTTP,  50  mCi  of 
[33P]dCTP,  and  50  mCi  of  [33P]dATP.  The  reaction  was 
incubated  for  2  h  at  16°C.  A  radiolabeled  probe  was  purified 
using  a  BioSpin-6  chromatography  column  (Bio-Rad)  and 
denatured  by  boiling  for  3  min.  A  purified  probe  was  added 
to  the  hybridization  roller  tube  containing  the  prehybridized 
GeneFilter  and  incubated  for  12-18  h  at  42°C  in  a  Robin 
Scientific  Roller  Oven.  For  these  studies,  the  NamedGenes 
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F/g.  I  The  quality  of  RNA  recovered  from  hu¬ 
man  core  breast  needle  biopsies.  RNA  was  evalu¬ 
ated  using  an  Agilent  2100  analyzer.  A ,  xe¬ 
nografts;  B ,  breast  core  needle  biopsies  arrayed  in 
Figs.  5  and  6  and  further  characterized  in  Table  4. 
In  the  images  displayed,  fluorescence  scales  are 
not  equivalent  between  lanes  but  have  been  nor¬ 
malized  for  clarity. 


Fig.  2  MDA-MB-231  human 
breast  tumor  xenografts  pro¬ 
cessed  for  frozen  tissue  section¬ 
ing  in  KNALater.  A,  no  wash; 
B ,  washed  in  PBS:RNA£ater 
(1:6;  v/v)  for  2  h  at  4°C;  C, 
washed  in  PBS:RNALa?er  (1:6; 
v/v)  for  5  min  at  4°C;  D , 
washed  in  PBS  for  5  min  on  ice. 


fife 


:  -"v 


filters  (Research  Genetics,  Inc.,  Huntsville,  AL)  were  used. 
These  filters  contain  4032  known  genes,  192  housekeeping 
genes,  and  192  control  genes  on  each  filter.  Each  hybridized 
GeneFilter  was  washed  twice  in  2X  SSC,  1%  SDS  at  50°C 
for  20  min  and  once  at  55°C  in  0.5  X  SSC,  1%  SDS  for  15 
min.  Hybridization  signals  were  detected  by  phosphorimag¬ 
ing  using  a  Molecular  Dynamics  Storm  Phosphorlmager 
(Molecular  Dynamics,  Sunnyvale,  CA).  The  sensitivity  and 
reproducibility  of  these  and  other  nylon  filter-based  cDNA 
microarrays  have  been  widely  reported  (7,  1 7-20). 


Normalization  of  Data.  Pathways  software  algorithms 
(Research  Genetics,  Inc.)  were  used  to  correct  for  nonspecific 
binding  of  the  probe  to  filter  (background  correction).  Ap¬ 
proaches  for  signal  normalization,  intended  to  correct  for  dif¬ 
ferences  in  probe  specific  activities,  hybridizations,  and  other 
interexperiment  variables,  are  diverse  (1 1).  In  the  present  study, 
the  average  of  all  data  points  was  used  to  calculate  a  normal¬ 
ization  factor;  the  normalized  intensity  value  for  each  spot  was 
obtained  by  multiplying  the  normalization  factor  by  the  raw 
intensity  (11). 
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Fig.  3  Human  breast  needle 
biopsies  processed  for  frozen 
tissue  sectioning  in  RNALater. 
A ,  no  wash;  B,  washed  in  PBS: 
'RNALater  (1:6;  v/v)  for  2  h  at 
4°C:  C,  washed  in  PBSrRNA- 
Later  (1:6;  v/v)  for  5  min  at 
4°C:  D,  washed  in  PBS  for  5 
min  on  ice. 


RNALater  4oC. 


i 


Freeze  in  Liquid  Nitrogen 
Store  at  -80oC 

I 

Wash  with  icc-cold  sterile  PBS 
on  ice  for  5  min 


Embed  in  OCT  and 
cut  frozen  section 


- ►  Pathology  - 


Fig.  4  Optimized  tissue  acquisition/processing 
procedure  for  breast  needle  biopsies. 


Processing: 

Remove  from  OCT  with  KNAXater 
Extract  RNA 


Store  at  -80oC 
or  process 


Uninformative 

l 

Release  to  study 


Informative 

l 

Retain 


Analysis  of  Gene  Microarray  Data.  The  optimal  ap¬ 
proach  for  analyzing  the  high  dimensional  gene  expression  data 
generated  by  gene  imcroarray  studies  remains  unclear.  The  high 
dimensionality  of  these  data  are  problematic,  with  most  existing 
analyses  functioning  more  accurately  in  low  dimensionality 
(21 ).  However,  rather  than  making  statistical  inference  for  iden¬ 
tifying  and  studying  functionally  relevant  genes,  the  study  goal 
was  to  validate  the  tissue  acquisition  and  processing  methods 
and  demonstrate  the  applicability  of  this  approach  for  building 
clinically  relevant  predictive  models. 

Recently,  we  devised  a  simple  approach  to  the  exploration 


of  small  studies  with  two  experimental  groups.5  Our  approach 
used  simple  statistical  analyses  to  reduce  data  dimensionality 
and  identify  subsets  of  discriminant  genes.  This  approach  is 
similar  in  principle  to  that  used  by  Hedenfalk  et  al.  (22). 
Because  the  class  of  each  sample  (cancer  versus  noncancer)  is 


5Z.  Gu.  Association  of  interferon  regulatory  factor- 1,  nuclcophosmin, 
nuclear  factor-kappa-B  and  cAMP  binding  with  acquired  resistance  to 
Faslodcx  (ICI  182,780),  submitted  for  publication. 
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Table  2  Concentration  of  RNA  recovered  from  breast  needle  biopsies 

Biopsies  RNA  S:  100  ng"  Total  RNA  recovered  (R  ±  SE)  Range  (>100  ng) 

n  ~  55h  51/55  (93%)  3.63  ±  0.48  jxg;  median  =  1.34  100  ng  to  12.60  jxg 

n  ~  25  Snap-frozen  2.04  ±  0.51  p.g;  median  =  1.32  jxg  100  ng  to  9.00  p,gr 

n  —  21  RNAiater  3.49  ±  0.78  (ig;  median  =  2.70  fxg  100  ng  to  12.60  jxg 

"Number  producing  =:100  ng  of  RNA,  the  minimum  useful  concentration  of  RNA  without  amplification,  irrespective  of  the  tissue  acquisition 
and  processing  method  applied. 

b  We  did  not  have  complete  data  on  processing  for  the  first  9  of  the  55  samples. 

CP  =  0.13;  Mann-Whitney  rank-sum  test;  RNA  Later  versus  snap-frozen  tissue. 


Table  3  Characteristics  of  breast  needle  biopsy  material 


A.  Biopsy 

Source 

ER/PR" 

%  Cancer 

%  Normal 

%  Fat 

%  CT 

RNA  (tig)" 

15A‘: 

Radiology 

ND 

0% 

70% 

0% 

30% 

3.28 

S6AC 

Surgery 

ND 

0% 

0% 

100% 

0% 

2.49 

10A 

Radiology 

ND 

0% 

5% 

45% 

50% 

2.07 

11A 

Radiology 

ND 

0% 

20% 

40% 

40% 

1.36 

17c 

Radiology 

ND 

2% 

0% 

90% 

8% 

6.70 

S2A 

Surgery 

+/+ 

90% 

0% 

5% 

5% 

3.20 

S3Ar 

Surgery 

+/+ 

90% 

0% 

0% 

10% 

2.70 

S10DC 

Surgery 

+/+ 

80% 

0% 

0% 

20% 

6.50 

S14BC 

Surgery 

80% 

0% 

0% 

20% 

4.20 

S18A 

Surgery 

+/- 

90% 

0% 

0% 

10% 

1.70 

B.  RNA  recovered  from  biopsies  used  in  this  study  (R±  SE)^ 


No  KNALater 

KNALater 

Overall  RNA  recovered 

2.21  ±  0.52  |ig  total  RNA" 

3.83  ±  0.73  M-g  total  RNA 

3.10  ±  1.60  |xg  total  RNA 

C.  Case 

Biopsies 

Pathological  diagnosis 

S2 

S2A 

Invasive  adenocarcinoma 

S2B 

Invasive  adenocarcinoma 

S2C 

Invasive  adenocarcinoma 

S6 

S6A 

No  cancer 

S6B 

No  cancer 

S6C 

No  tissue 

S10 

S10A 

No  cancer 

S10B 

Possible  DCIS 

S10C 

No  cancer 

S10D 

Invasive  adenocarcinoma 

"  PR,  progesterone  receptor;  CT,  connective  tissue;  DCIS,  ductal  carcinoma  in  situ;  ND,  not  determined. 

h  Total  RNA  recovered  from  each  needle  biopsy.  Five  hundred  ng  of  each  RNA  population  were  used  to  generate  the  probes  hybridized  to  obtain 
the  data  presented  in  Fig.  5. 

r  Biopsies  processed  in  KNALater. 

d  P  —  0.129;  Student’s  t  test;  KNALater  versus  no  KN  Ala  ter. 


known  from  the  histopathological  analyses,  dimensionality  can 
be  reduced  in  a  supervised  manner  by  performing  a  series  of 
statistical  tests.  The  major  purpose  of  performing  these  tests  was 
only  to  select  a  group  of  genes  that  would  be  used  for  data 
visualization  and  analysis.  Student’s  t  test  and  a  t  test  for 
unequal  variances  (each  assumes  normal  distribution  of  the 
data)  and  a  nonparametric  (distribution-free)  Wilcoxon  test  were 
used.  Whereas  the  inflated  type  1  error  will  overestimate  sig¬ 
nificant  differences,  the  incidence  of  false  negative  estimates 
should  be  smaller.  Because  the  distribution  of  the  data  among 
and  within  replicate  experiments  and  for  individual  genes  can¬ 
not  be  determined  (23),  both  logarithm-transformed  and  non- 
transformed  data  were  compared. 

Two  reduced  dimensional  data  sets  were  selected;  one 
comprising  genes  with  Ps  <  0.05,  and  one  comprising  genes 


with  Ps  <  0.02.  Because  of  their  marked  biological  differences, 
these  phenotypes  should  be  easily  separable.  Thus,  the  data  were 
visualized  using  our  Fisher  separability-based  multidimensional 
scaling  approach  that  projects  high  dimensional  data  into  three- 
dimensional  data  space  (24,  25).  Because  it  has  become  widely 
used,  visualization  using  the  simple  hierarchical  clustering  de¬ 
scribed  by  Eisen  et  al.  (26)  is  also  presented. 

Generation  and  Testing  of  a  Neural  Network.  To  de¬ 
termine  whether  the  genes  we  selected  could  be  used  to  separate 
cancer  from  noncancer  tissues,  a  neural  network  was  trained 
using  the  gene  expression  microarray  data  from  five  cancer 
biopsies  and  five  noncancer  biopsies.  Neural  networks  can  be 
considered  as  parallel  computing  systems  consisting  of  many 
simple  processors  with  many  interconnections.  The  main  advan¬ 
tages  of  neural  networks  are  that  they  can  leam  complex  non- 
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linear  input-output  relationships,  use  sequential  training  proce¬ 
dures,  and  adapt  themselves  to  the  data  (27,  28). 

The  learning  process  involves  updating  network  architec¬ 
ture  and  connection  weights  so  that  the  predictive  model  can 
efficiently  perform  a  specific  classification  task.  We  used  a 
multilayer  preceptron  to  design  a  nonlinear  neural  classifier, 
using  each  of  the  gene’s  expression  levels  in  the  tissue  samples 
as  the  input  and  the  cancer  versus  noncancer  phenotype  of  each 
sample  as  the  output.  Consequently,  the  network  output  comes 
to  approximate  the  posterior  Bayesian  probabilities  of  a  sample 
being  either  cancer  or  noncancer  given  its  gene  expression 
profile  (27,  29,  30).  Three  experimental  configurations  were 
tested,  with  either  the  top  40,  80,  and  103  dimensions  (data  set 
selecting  the  top  103  genes;  P  <  0.05)  or  10,  20,  and  30 
dimensions  (data  set  selecting  the  top  30  genes;  P  <  0.02). 
These  top  genes  were  selected  based  on  their  fold  difference 
between  cancer  and  noncancer  and  their  respective  Ps.  Two 
prediction  models  were  built,  one  with  3  hidden  nodes  and  8 
inputs  and  one  with  5  hidden  nodes  and  18  inputs.  Mean- 
squared  error  estimates  were  used  to  explore  network  perform¬ 
ance.  The  “leave-one-out”  method  was  used  for  the  initial  test¬ 
ing  and  training  of  each  neural  network  (27,  29,  30). 

RESULTS 

RNA  Quality  and  Tissue  Architecture  from  Xenograft 
Tissues  Processed  Using  RNALoter.  Recovery  of  high-qual¬ 
ity  RNA  was  optimal  when  OCT-embedded  tissue  samples  were 
removed  from  frozen  blocks  using  a  small  volume  of  RNA- 
Later ,  which  thawed  and  softened  the  embedding  medium  be¬ 
fore  tissue  extraction  from  the  blocks.  Thus,  the  frozen  block 
was  placed  in  a  small  plastic  tray,  with  the  embedded  tissue 
facing  up,  and  500  jxl  of  RNA Later  were  pipetted  on  top.  Using 
this  method,  seven  of  eight  samples  yielded  high-quality  RNA 
(Fig.  1  A;  RNA  integrity  analyzed  using  the  Agilent  2100  bio¬ 
analyzer  and  RNA  6000  LabChip  kit).  In  previous  experiments, 
where  the  OCT  block  was  dissolved  by  vigorous  shaking  in  a 
large  volume  of  PBS  and  tissue  fragments  were  recovered  with 
a  strainer,  only  6  of  1 2  samples  yielded  fully  intact  RNA  (data 
not  shown). 

Pathology  was  not  interpretable  from  material  frozen  di¬ 
rectly  in  RNA  Later  and  transferred  to  OCT  without  wash  steps 
(Fig.  2 A).  This  reflects  inadequate  freezing  in  the  cryostat  and 
consequent  tissue  folding  during  the  cutting  process.  When 
washed  in  PBS:RNA£ater  (1:6)  for  2  h  at  4°C,  the  tissue  did  not 
fold  on  cutting,  but  cell  outlines  appeared  blurred,  making 
pathological  interpretation  difficult  (Fig.  2 B).  Washing  in  PBS: 
KNALater  (1:6)  for  5  min  at  4°C  also  eliminated  tissue  folding, 
but  now  the  cell  outlines  appeared  distinct.  Nonetheless,  tissue 
fragmentation  occurred  in  some  specimens,  making  pathological 
interpretation  suboptimal  (Fig.  2 C).  Optimal  preservation  of 
tissue  architecture  was  obtained  by  washing  tissue  for  5  min  on 
ice  with  ice-cold  PBS.  Cell  margins  were  clear,  tissue  folding 
and  fragmentation  were  not  observed,  and  the  integrity  of  the 
cores  was  maintained,  allowing  optimal  pathological  interpreta¬ 
tion  (Fig.  2D).  Similar  data  were  obtained  from  a  human  breast 
core  biopsy  released  to  this  study  (Fig.  3  A~D ).  A  scheme  of  the 
optimized  tissue  acquisition  protocol  is  shown  in  Fig.  4. 


B 


Fig.  5  Structure  of  the  gene  expression  data  using  the  top  1 03  genes.  A , 
projection  from  top  103  genes  (103  dimensions),  selected  by  t  test 
comparisons  of  log10-transformed  gene  expression  data  (P  <  0.05)  and 
projected  into  three  dimensions;  B,  hierarchical  clustering  of  samples  in 
2  dimensions  based  on  1 -Pearson’s  coefficient  matrix  of  the  top  103 
genes.  Snap-frozen  tissue:  O,  noncancer;  A,  cancer.  RNA  Later- 
processed  tissue:  •,  noncanccr;  ▲,  cancer. 


Recovery  of  High-quality  RNA  from  Human  Breast 
Biopsies  for  Gene  Microarray  Studies.  Cores  were  removed 
from  OCT  by  placing  the  frozen  block  in  a  small  plastic  tray,  with 
the  embedded  tissue  facing  up,  and  pipetting  500  p,l  of  RNALarer 
on  top.  Intact  cores  were  easily  picked  out  of  the  OCT,  which 
remained  semisolid,  using  a  sterile  pipette  tip.  From  a  study  of  55 
breast  needle  biopsies,  we  obtained  >  1 00  ng  of  RNA  on  almost  all 
samples  (Table  2).  The  median  value  (1.34  p,g)  shows  that  most 
biopsies  produce  sufficient  RNA  to  generate  data  using  500  ng  of 
total  RNA.  There  was  no  significant  difference  between  frozen  and 
RNALafcr-processed  biopsies  in  the  mean  concentrations  of  total 
RNA  recovered  (Tables  2  and  3).  Thus,  prospectively  collected 
breast  needle  biopsies,  either  directly  snap-frozen  or  processed  in 
RNADzter,  can  produce  adequate  RNA  concentrations  for  use  in 
gene  microarray  studies. 

A  further  requirement  of  gene  expression  microarray  ex¬ 
periments  is  the  isolation  of  high-quality  RNA  (11).  Sufficient 
RNA  was  not  recovered  to  allow  for  an  assessment  of  RNA 
quality  by  both  standard  gel  electrophoresis  methods  and  gene 
microarray  studies  on  the  same  samples.  Because  gel  electro¬ 
phoresis  requires  ~1  |xg  of  RNA,  the  Agilent  2100  “lab-on-a- 
chip”  technology  was  used  to  assess  RNA  quality.  This  tech¬ 
nology  requires  only  100  ng  of  RNA  to  determine  quality,  with 


1162  Breast  Biopsies  for  Expression  Microarrays 


Table  4  Genes  comprising  the  30-dimensional  data  set 

A.  Genes  that  appear  up-regulated  in  cancer  tissue 

P 

Gene 

C:NC* 

t  test7’ 

Unequal 

t  test  log10 

Wilcoxon 

BTF-2 

3.2 

0.010 

0.026 

0.001 

0.008 

pl60 

3.1 

0.016 

0.030 

0.027 

0.095 

spr2 

3.0 

0.006 

0.013 

0.007 

0.008 

Interferon  inducible  9-27 

2.9 

0.007 

0.019 

0.003 

0.008 

Human  surface  antigen 

2.7 

0.018 

0.035 

0.023 

0.056 

Grbl4 

2.6 

0.004 

0.011 

0.001 

0.008 

gp250  precursor 

2.6 

0.008 

0.009 

0.002 

0.016 

TAKJ-binding  protein 

2.4 

0.014 

0.010 

0.002 

0.008 

Myosin-binding  protein  H 

2.3 

0.016 

0.022 

0.011 

0.032 

RP3 

2.3 

0.006 

0.008 

0.004 

0.016 

a-Catenin 

2.3 

0.001 

0.004 

0.001 

0.008 

T3  receptor  cofactor -J 

2.3 

0.005 

0.006 

0.015 

0.016 

DA  P-3 

2.2 

0.017 

0.021 

0.025 

0.032 

Selenoprotein-W  (selW) 

2.1 

0.015 

0.023 

0.018 

0.056 

Ferroxidase 

2.1 

0.007 

0.011 

0.008 

0.016 

Cvtochrome  c 

2.1 

0.017 

0.017 

0.044 

0.032 

RAN-BP8 

2.1 

0.017 

0.018 

0.016 

0.032 

Aspartate  aminotransferase- 1 

1.9 

0.010 

0.011 

0.009 

0.032 

unc- 18  homologue 

1.9 

0.007 

0.010 

0.005 

0.008 

Phosphethanolamine  cytidylyltransferase  1 .9 

0.007 

0.007 

0.001 

0.016 

Frezzled  (fre) 

1.9 

0.016 

0.024 

0.014 

0.016 

Interferon  a-induced  11.5  kDa 

1.8 

0.017 

0.019 

0.017 

0.032  . 

Ubiquitin  activating  enzyme  El 

1.8 

0.015 

0.015 

0.011 

0.032 

Macrophage-stimulating  1 

1.8 

0.014 

0.014 

0.022 

0.032 

Ah  receptor 

1.8 

0.002 

0.004 

0.002 

0.008 

B.  Genes  that  appear  up-rcgulatcd  in  noncanccr  tissues 

P 

Gene 

C:NC" 

t  test7’ 

Unequal 

t  test  logi0 

Wilcoxon 

Neurofibromin  2 

0.6 

0.005 

0.005 

0.006 

0.008 

Frizzled-related  protein 

0.5 

0.016 

0.025 

0.015 

0.031 

Type  II  keratin 

0.4 

0.006 

0.007 

0.027 

0.008 

CAGH4 

0.4 

0.004 

0.006 

0.003 

0.008 

Dihydroguanosine  triphosphatase 

0.3 

0.014 

0.033 

0.002 

0.008 

"  C:NC,  ratio  of  expression  level  in  cancer  versus  noncanccr.  Genes  were  selected  on  the  basis  of  C:NC  ^  1.8  (approximately  2-fold);  P  £  0.02 
(estimated  to  three  significant  figures)  in  Student’s  t  test. 

h  t  tests  used  are:  /  test,  Student’s  (untransformed  data);  Unequal,  unequal  variance  (un transformed  data);  t  test  log10.  Student’s  t  test  on 
log10-transformed  data;  Wilcoxon,  Wilcoxon  rank-sum  test  (nonparamctric). 


specificity  comparable  with  or  better  than  that  obtained  from 
standard  gel  electrophoresis.  Consequently,  RNA  quality  can  be 
assessed  on  samples  that  will  later  be  subjected  to  gene  microar¬ 
ray  analysis.  As  is  evident  from  Fig.  IB,  >90%  of  representative 
biopsies  produced  high-quality  RNA. 

Analysis  of  Core  Needle  Breast  Biopsies  and  Visualiza¬ 
tion  of  Gene  Expression  Data.  To  assess  the  applicability  of 
the  tissue  processing  procedure,  we  obtained  total  RNA  from 
five  random  breast  cancer  biopsies  and  five  random  biopsies  of 
noncancer  tissue  (Table  3).  All  tissues  were  evaluated  by  the 
study  pathologist  before  release  for  our  studies  to  ensure  that  the 
investigational  cores  contained  no  diagnostically  useful  infor¬ 
mation.  Both  biopsies  processed  in  KNALater  and  biopsies 
frozen  without  RNAXater  were  analyzed.  These  biopsies  were 
approximately  equally  represented  in  each  group  ( KNAJLater 
processed:  cancer  =  3;  noncancer  =  3).  RNA  was  prepared,  and 
probes  were  generated  as  described  above.  The  mean  RNA 
concentrations  recovered  by  both  methods  were  comparable 


(see  also  Table  2).  Probes  were  hybridized  to  NamedGene 
filters,  and  signal  was  measured  using  a  Molecular  Dynamics 
Storm  Phosphorlmager.  Digitized  representations  of  the  hybrid¬ 
ized  filter  signals  were  imported  into  the  Pathways  software  for 
background  correction  and  normalization. 

Normalized  gene  expression  data  were  imported  into  the 
visualization  algorithm,  and  scatter  plots  of  the  gene  expression 
data  were  generated.  We  first  reduced  dimensionality  by  elim¬ 
inating  noninformative  genes.  Hence,  we  excluded  those  genes 
whose  expression  was  not  likely  to  be  different  between  the 
cancer  and  noncancer  groups  (multiple  t  tests,  P  >  0.05).  A  total 
of  103  genes  met  this  criterion  and  were  used  to  generate  a 
three-dimensional  (from  1 03-dimensional)  plot  of  the  data  (Fig. 
5A).  The  three  axes  are  the  first  three  principal  components 
fitted  to  the  cancer  and  noncancer  molecular  profile  data.  The 
cumulative  proportion  of  the  variance  captured  by  each  princi¬ 
pal  component  axis  is:  (a)  principal  component  axis  1 ,  55%;  (b) 
principal  component  axis  2,  72%;  and  (c)  principal  component 
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axis  3,  79%.  We  also  applied  hierarchical  clustering,  similar  to 
approaches  used  by  others  (26),  based  on  Euclidean  space 
analysis  (l-Pearson’s  correlation  coefficient  matrix).  The  latter 
approach  could  not  completely  separate  two  cancers  from  the 
clusters  of  noncancers  (Fig.  5 B).  PCA6-based  multidimensional 
scaling  visualization  separated  breast  cancers  ( triangles )  and 
noncancer  tissue  (circles)  into  linearly  separable  gene  expres¬ 
sion  data  space.  However,  it  should  be  noted  that  neither  ap¬ 
proach  provides  a  statistical  assessment  of  separability,  only  a 
visualization  of  data  structure.  Whereas  the  number  of  data 
points  is  limited,  the  multidimensional  scaling  visualization  is 
consistent  with  our  ability  to  identify  a  putative  molecular 
profile  that  can  separate  neoplastic  from  nonneoplastic  tissues. 

This  subset  of  genes  is  expected  to  include  some  false 
positives,  reflecting  the  type  1  error  associated  with  the  selec¬ 
tion.  Consequently,  data  dimensionality  was  further  reduced 
using  more  conservative  criteria  (P  <  0.02  and  regulation  2:1.8- 
fold).  We  chose  this  fold  regulation  to  include  all  ^2-fold 
differences  in  mean  gene  expression  levels  between  cancer  and 
noncancer  tissues.  The  analysis  produced  a  30-dimensional  data 
set;  25  signals  (genes)  were  up-regulated  in  the  neoplastic 
biopsies  (Table  4A),  and  5  signals  were  up-reguiated  in  the 
nonneoplastic  biopsies  (Table  4 B).  The  ability  of  this  subset  to 
separate  cancer  from  noncancer  was  also  evaluated  using  both 
our  PCA-based  multidimensional  scaling  approach  and  simple 
hierarchical  clustering.  The  cumulative  proportion  of  the  vari¬ 
ance  captured  by  each  principal  component  axis  is:  (a)  principal 
component  axis  1,  64%;  (6)  principal  component  axis  2,  75%; 
and  (c)  principal  component  axis  3,  82%.  Neoplastic  and  non¬ 
neoplastic  tissues  (Table  3)  were  now  linearly  separable  in  gene 
expression  data  space  by  both  visualization  methods  (Fig.  6,  A 
and  B). 

Neural  Network  Predictors  of  Biopsy  Phenotypes. 

Having  reduced  the  dimensionality,  it  was  necessary  to  assess 
whether  the  expression  patterns  of  remaining  genes  in  the  103 
and  30  dimensions  contained  useful  discriminatory  information. 
Thus,  the  ability  of  various  gene  subsets  to  train  accurate  neural 
network  predictors  that  could  predominantly  separate  cancer 
from  noncancer  tissues  was  assessed.  The  three  configurations 
tested  (1-3  hidden  nodes)  for  genes  within  the  30-  and  103- 
dimensional  data  sets  are  described  in  “Materials  and  Methods.” 
All  were  evaluated  using  the  leave-one-out  method.  Whereas  the 
number  of  microarrays  from  which  the  data  are  obtained  is  small 
(n  =  10),  each  configuration  achieved  a  0%  misclassification 
rate  (network  training)  for  cancer  versus  noncancer,  whether  in 
103  or  30  dimensions  and  with  either  logU)  or  nontransformed 
gene  expression  data. 

Because  the  initial  training  and  testing  were  done  on  the 
original  data  set  from  the  Georgetown  University  samples,  we 
tested  the  neural  networks  against  an  independent  data  set  of 
nine  frozen  breast  specimens  from  the  University  of  Edinburgh. 
These  were  snap-frozen  mastectomy  specimens  rather  than  core 
needle  breast  biopsies,  but  they  should  contain  a  mixture  of 
cancer  and  noncancer  cells  and  provide  a  strong  and  independ- 


6  The  abbreviations  used  are:  PCA,  principal  component  analysis;  ER, 
estrogen  receptor. 


PCA  1 


B 


Fig.  6  Structure  of  the  gene  expression  data  using  the  top  30  genes.  A . 
projection  from  top  30  genes  (30  dimensions),  selected  by  t  test  com¬ 
parisons  of  logIO-transformcd  gene  expression  data  (P  £  0.02;  2: 1.8- 
fold  difference)  and  projected  into  three  dimensions;  5,  hierarchical 
clustering  of  samples  in  2  dimensions  based  on  l-Pearson’s  coefficient 
matrix  of  the  top  30  genes.  Snap-frozen  tissue:  O.  noncancer;  A,  cancer; 
RNAItf/er-proccsscd  tissue:  •,  noncanccr;  ▲,  cancer. 


ent  challenge  for  the  neural  network.  The  neural  network  model 
should  accurately  predict  as  cancer  any  biopsy  comprising 
>80%  cancer  tissue. 

Gene  expression  data  were  generated  using  the  same  Re¬ 
search  Genetics  filter  technology  and  queried  in  the  predictive 
model.  For  both  the  103  and  30  gene  data  sets,  the  nontrans¬ 
formed  data  provided  the  more  accurate  models.  Both  models 
predicted  that  all  nine  samples  should  be  cancer  and  not  non¬ 
cancer.  The  pathologist  who  evaluated  the  samples  for  the 
training  set  subsequently  performed  histopathological  analysis 
of  stored  samples  of  these  tissues.  All  nine  samples  were  con¬ 
firmed  as  2:80%  cancer  specimens.  Thus,  no  samples  in  the 
independent  test  data  set  were  misclassified,  demonstrating  the 
neural  network’s  predictive  accuracy.  When  the  log10  data  were 
used,  the  models  misclassified  1  of  9  tumors  (30  dimensions; 
89%  accurate)  and  2  of  9  tumors  (103  dimensions;  78%  accu¬ 
rate).  The  lower  classification  rate  with  the  103  genes  probably 
reflects  the  increased  type  1  error  associated  with  this  data  set 
and  the  failure  to  exclude  some  uninformative  genes. 

Genes  Differentially  Expressed  between  Breast  Cancer 
and  Noncancer  Tissues.  The  data  in  Table  4  show  that  the 
choice  of  t  test  has  only  a  marginal  effect  on  data  selection  for 
supervised  dimension  reduction.  If  we  make  no  assumption 
regarding  distribution  of  the  data,  approximately  1  in  3  genes 
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Table  5  Function  of  selected  genes 

UniGene 

Gene  namc(s) 

no." 

Function 

Ref.  no. 

BTF2  (butyropbilin) 

Hs.  167741 

Glycoprotein  component  of  human  milk  fat  globule  membranes; 
membrane-associated  receptor  for  association  of  cytoplasmic 
droplets  with  the  apical  plasma  membrane 

35  and  41 

grbJ4  (growth  factor  receptor-bound 
protein  14) 

Hs.83070 

Member  of  the  grb7  family;  phosphorylated  by  a  PDGF- 
regulated  serine  kinase;  expression  correlates  with  ER 
expression 

37 

TAB!;  TAKl  (MAPKKK;  TGFp-activated 
kinase- 1)  binding  protein- 1 

Hs.31472 

Stimulates  NFkB  activation;  implicated  in  signaling  in  response 
to  TGF- (3  and  TNF-a;  activates  plasminogen  activator 
inhibitor  1 

42  and  43 

a-Catenin 

Hs.  178452 

Cell  adhesion  molecule;  binds  E-cadherin;  associated  with 
tumor  grade  and  ER  expression 

38  and  44 

DAP 3  (death-associated  protein-3) 

Hs.159627 

Proapoptotic,  nucleotide-binding  protein 

45  and  46 

Ceruloplasmin;  ferroxidase 

Hs.28896 

Copper  transport  protein;  present  in  breast  milk;  serum  levels 
are  elevated  in  breast  cancer  patients  with  progressive  disease 
but  not  in  patients  in  remission  or  those  with  benign  breast 
lesions 

33,  34,  and  47 

Ah  receptor  (aryl  hydrocarbon  receptor) 

Hs.  172287 

Binds  environmental  toxins;  interacts  with  ER;  can  block  the 
transcriptional  activity  of  ER;  binds  ER  co-regulators 

ERAP140  and  SMRT 

36,  48,  and  49 

NF2  (ncurofibromatosis-2) 

Hs.902 

Tumor  suppressor;  rarely  mutated  in  breast  cancer 

50  and  51 

Frizzled-related  gene 

Hs.7306 

Secreted  protein;  lost  in  —80%  of  breast  cancers;  apoptosis 
related  gene,  induced  by  Adriamycin 

52  and  53 

a  The  UniGene  databases  can  be  found  at  http://www.ncbimlm.nih.gov/UniGcnc/Hs.Homc.html. 

h  PDGF,  platelet-derived  growth  factor;  NFkB,  nuclear  factor  kB;  TGF,  transforming  growth  factor;  TNF,  tumor  necrosis  factor. 


would  be  rejected  by  relying  solely  on  the  nonparametric  anal¬ 
yses,  a  >  1 .8-fold  differential  expression,  and  a  cutoff  of  P  < 
0.02.  The  30  target  cDNAs  comprising  the  30-dimensional  data 
set  are  presented  in  Table  4. 

DISCUSSION 

Generally,  prospective  study  designs  are  more  valid  for  the 
exploration  or  validation  of  new  predictive  and  prognostic  fac¬ 
tors.  Retrospective  breast  cancer  studies  may  be  compromised 
by  the  bias  toward  larger  tumors  in  many  existing  frozen  tumor 
banks,  whereas  the  average  size  of  most  newly  diagnosed  breast 
tumors  continues  to  decrease  (9).  Thus,  many  studies  into  the 
molecular  biology  of  such  early  lesions  may  need  to  be  done 
prospectively.  Investigators  at  single  academic  institutions  can 
often  prospectively  obtain  frozen  samples  under  a  rigorous 
collection  protocol.  However,  the  ability  to  do  so  at  multiple 
institutions  or  when  local  clinics  and  community  physicians  are 
also  involved  can  be  problematic.  A  rapid,  standard  tissue 
processing  approach  should  allow  for  the  use  of  tissues  from 
multiple  institutions  in  a  controlled  manner.  For  example,  it 
should  be  possible  to  reduce  possible  changes  in  molecular 
profiles  associated  with  differences  in  tissue  acquisition  and 
processing.  Whereas  these  concerns  have  not  been  explored  in 
detail  experimentally,  tissue  processing  clearly  affects  the  per¬ 
formance  of  other  molecular  biological  technologies  applied  to 
human  biopsies  and  tumor  tissues  (10). 

To  address  these  issues,  we  conducted  studies  to  identify 
an  optimal  tissue  acquisition,  processing,  and  analysis  pro¬ 
cedure  for  exploring  the  gene  expression  profiles  of  prospec¬ 
tively  accrued  breast  core  needle  biopsies.  Because  RNA 
extraction  destroys  tissue  architecture,  we  developed  a  novel 
method  for  tissue  processing  that  would  allow  us  to  obtain 


samples  in  a  uniform  manner,  preserve  RNA  quality/quantity, 
and,  most  importantly,  retain  all  potentially  diagnostically 
relevant  information. 

Tissue  placed  in  RNAXafer  can  be  left  at  room  temperature 
for  up  to  1  h  at  37°C,  1  week  at  25°C,  and  >  1  month  at  4°C  and 
retain  fully  intact  RNA  (31).  Our  data  show  that  biopsies  pro¬ 
cessed  immediately  in  either  liquid  nitrogen  or  RNA Later  can 
produce  sufficient  concentrations  of  high-quality  RNA  for  nylon 
filter  microarray  analysis  without  RNA  amplification.  This 
amount  of  RNA  is  also  adequate  for  amplification  for  use  with 
other  gene  expression  microarray  technologies  (32).  If  pro¬ 
cessed  carefully,  tissue  architecture  can  be  maintained  from 
biopsies  collected  in  RN ALater.  This  is  clearly  important  be¬ 
cause  some  small  breast  lesions  can  be  completely  removed  by 
the  biopsy  procedure.  These  core  biopsies  should  not  be  used  for 
studies  if  critical  diagnostic  information  could  be  lost.  We 
estimate  that,  using  the  approaches  described  in  this  study, 
approximately  90%  of  suitable  core  needle  breast  biopsies 
should  produce  sufficient  material  for  gene  expression  microar¬ 
ray  studies. 

Our  studies  demonstrate  that  the  RNA  recovered  can  be 
used  to  generate  relevant  gene  expression  microarray  informa¬ 
tion.  Relevance  is  evident  from  our  abilities  to  identify  differ¬ 
entially  expressed  genes  associated  with  breast  cancer  cells  and 
to  build  accurate  neural  network  predictors  that  can  identify 
cancer  from  noncancer  samples  based  solely  on  their  gene 
expression  profiles. 

Among  the  differentially  expressed  genes  in  the  reduced 
30-dimensional  space,  we  would  expect  to  find  either  some 
genes  already  implicated  in  breast  cancer  or  known  to  be  ex¬ 
pressed  in  normal  or  neoplastic  breast  tissues.  Consistent  with 
this  expectation,  several  genes  of  potential  relevance  were  iden- 
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tified.  For  example,  ceruloplasmin  is  up-regulated  in  neoplastic 
breast  tissues,  and  elevated  serum  levels  of  ceruloplasmin  are 
associated  with  recurrent- breast  cancers  (33,  34).  The  BT2 
glycoprotein  is  a  milk  protein  (35)  and  might  be  expected  to  be 
expressed  in  tissues  predominately  composed  of  breast  epithe¬ 
lial  cells. 

ER  protein  status  is  determined  routinely  for  cancer  but  not 
noncancer  biopsies.  Because  four  of  the  five  tumor  biopsies 
were  ER  positive,  we  also  would  expect  to  find  genes  with 
expression  patterns  known  either  to  be  associated  with  ER  or  to 
modulate  ER  function.  At  least  three  genes  meet  these  criteria. 
The  aryl  hydrocarbon  receptor  is  known  to  interact  with  ER  and 
affect  its  function  (36),  and  the  expression  of  both  grbl4  and 
a-catenin  is  associated  with  ER  expression  in  breast  tumors 
(Refs.  37  and  38:  Table  5). 

The  discriminant  power  of  the  genes  selected  is  evident 
from  the  accuracy  of  the  neural  networks  built  using  the  data 
from  the  initial  five  cancer  and  five  noncancer  biopsies.  The 
ability  to  accurately  identify  independent  samples  as  cancer 
shows  that  the  genes  of  interest  are  expressed  or  repressed  in 
both  patterns  and  at  levels  consistent  with  the  model.  This  is  an 
appropriate  and  rigorous  test  of  the  approach  because  the  goal 
was  to  build  molecular  predictors,  rather  than  to  identify  func¬ 
tionally  relevant  genes.  Building  a  predictor  is  also  a  much  more 
efficient  test  of  the  selected  genes  than  would  be  obtained  by 
simply  confirming  expression  gene  by  gene  in  more  standard 
assays:  Northern  blot,  RNase  protection,  or  real-time  PCR. 
Confirming  the  differential  expression  of  each  gene  is  unneces¬ 
sary  for  building  clinically  relevant  predictive  models.  Unlike 
studies  to  identify  functionally  relevant  genes,  the  discriminate 
power  of  each  signal  from  the  target  cDNAs  on  the  array  is 
independent  of  whether  that  signal  originates  from  hybridization 
to  its  expected  mRNA. 

The  gene  expression  profile  data  and  neural  network  per¬ 
formance  suggest  that,  at  least  for  samples  of  very  different 
biologies,  contamination  of  samples  with  >80%  of  other  cell 
types  may  not  confound  analyses  for  molecular  profiling. 
Whether  this  observation  can  be  extrapolated  to  other  studies 
remains  to  be  further  established.  Nonetheless,  the  resource 
intensive  requirements  of  microdissection  and  RNA  amplifica¬ 
tion  may  not  be  absolute  requirements  for  all  molecular  profil¬ 
ing  studies. 

The  tissue  acquisition  and  processing  methods,  dimension 
reduction,  data  visualization  approaches,  and  neural  network 
analyses  we  describe  may  be  useful  in  the  design  of  larger 
prospective  studies.  We  continue  to  develop  other  data  visual¬ 
ization,  normalization,  and  exploration  algorithms  that  also  may 
be  of  use  in  the  analysis  of  gene  expression  microarray  studies 
(24,  25,  39,  40). 
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